Deep Learning Weekly: Issue 367
Fine-tuning now available for GPT-4o, Advanced RAG: Query Expansion, Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge), a paper on LongWriter, and many more!
This week in deep learning, we bring you Fine-tuning now available for GPT-4o, Advanced RAG: Query Expansion, Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge), and a paper on LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.
You may also enjoy HALVA: Hallucination Attenuated Language and Vision Assistant, New LLM Pre-training and Post-training Paradigms, a paper on Bilateral Reference for High-Resolution Dichotomous Image Segmentation, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Fine-tuning now available for GPT-4o
OpenAI launched fine-tuning for GPT-4o, one of the most requested features from developers.
Prompt caching with Claude \ Anthropic
Prompt caching, which enables developers to cache frequently used context between API calls, is now available on the Anthropic API.
Salesforce releases xGen-MM to advance visual language understanding
Salesforce has released a new suite of large, open-source multimodal models called xGen-MM (also known as BLIP-3).
OpenAI agrees content licensing deal with Condé Nast to feed SearchGPT and ChatGPT
OpenAI has struck a deal with Condé Nast for ChatGPT and SearchGPT to be able to access content from publications including The New Yorker, Vogue, Condé Nast Traveler, Architectural Digest, GQ, Vanity Fair and Wired.
HALVA: Hallucination Attenuated Language and Vision Assistant
Researchers from Google introduce a contrastive tuning method that can be applied to off-the-shelf MLLMs to mitigate hallucinations, while preserving their general vision-language capabilities.
MLOps & LLMOps
An article that walks you through how to expand keyword queries to improve recall and provide more context to a RAG system.
Implementing ‘From Local to Global’ GraphRAG with Neo4j and LangChain
A comprehensive tutorial on combining text extraction, network analysis, and LLM prompting for improved RAG accuracy.
Locally running RAG pipeline with Verba and Llama3 with Ollama
A post that explores various methods on how to run Verba, from running it locally with Weaviate Cloud, to connecting a local instance of Weaviate to Ollama.
Learning
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Eugene Yan’s article covers key considerations, use cases, and techniques for using LLM-evaluators, as well as critiques and support for their adoption.
New LLM Pre-training and Post-training Paradigms
Sebastian Raschka delves into the pre-training and post-training pipelines of the most recent state-of-the-art models.
A Fresh Look at Nonlinearity in Deep Learning
An article that extends our traditional explanations of nonlinearity in deep learning.
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model
A post that discusses the best practices for pruning and distilling, and shows the effectiveness when applied to the Llama 3.1 8B model to obtain a Llama-3.1-Minitron 4B model.
Libraries & Code
A multimodal agent framework for solving complex tasks.
An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT).
Papers & Publications
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Abstract:
Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.
Can You Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Abstract:
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-banks as inputs, and thus, training them on self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for a downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the design of the downstream model for speaker verification using self-supervised features. We show that we can simplify the model to use 97.51% fewer parameters while achieving a 29.93% average improvement in performance on SUPERB. Consequently, we show that the simplified downstream model is more data efficient compared to the baseline--it achieves better performance with only 60% of the training data.
Bilateral Reference for High-Resolution Dichotomous Image Segmentation
Abstract:
We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). The LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference and gradient maps serve as the target reference. These components collaborate to generate the final predicted maps. We also introduce auxiliary gradient supervision to enhance focus on regions with finer details. Furthermore, we outline practical training strategies tailored for DIS to improve map quality and training process. To validate the general applicability of our approach, we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance, outperforming task-specific cutting-edge methods across all benchmarks.
Just wanted to thank you for your work.