Deep Learning Weekly: Issue 375
OpenAI's MLE-bench, Log Datasets & Evaluate LLM Performance with Opik, An Opinionated Evals Reading List, a paper on Pyramidal Flow Matching for Efficient Video Generative Modeling, and many more!
This week in deep learning, we bring you OpenAI's MLE-bench, Log Datasets & Evaluate LLM Performance with Opik, An Opinionated Evals Reading List, and a paper on Pyramidal Flow Matching for Efficient Video Generative Modeling.
You may also enjoy Basecamp Research draws $60M to build a 'GPT for biology', How Shopify improved consumer search intent with real-time ML, a paper on F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
OpenAI introduced MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.
Basecamp Research draws $60M to build a 'GPT for biology'
Basecamp Research has raised $60 million to build an AI agent that not only answers questions about biology and biodiversity, but produces new insights that humans could not achieve alone.
HuggingFace announced the stable release of Gradio 5, which comes with an AI playground and the opportunity for low-latency streaming.
AI observability firm Galileo raises $45M to improve AI model accuracy
Galileo, an enterprise AI observability and evaluation platform provider, announced that it has raised $45 million in new funding.
MLOps & LLMOps
OpenAI Evals: Log Datasets & Evaluate LLM Performance with Opik
A technical blog post demonstrating how to use OpenAI Evals and the Opik platform to log datasets and evaluate LLM performance.
How Shopify improved consumer search intent with real-time ML
Google’s blog post about how Shopify used real-time machine learning (ML) and embedding pipelines to improve consumer search intent.
Ray Batch Inference at Pinterest
A blog post about how Pinterest Engineering uses Ray to perform batch inference for their ML models, including large language models (LLMs).
How Salesforce Builds Reproducible Red Teaming Infrastructure
Salesforce discusses how they make reproducible infrastructure for red teaming AI models.
Scaling RAG from POC to Production
An article outlining the challenges and architectural components for scaling RAG from proof-of-concept (POC) to production.
Learning
The AI Developer’s Dilemma: Proprietary AI vs. Open Source Ecosystem
An article arguing that AI developers should use smaller, targeted AI models instead of larger, general-purpose models.
Leveraging Mechanistic Interpretability for Red-Teaming: Haize Labs x Goodfire
A technical blog post discussing how to leverage mechanistic interpretability tools for red teaming LLMs.
Best Prompt Techniques for Best LLM Responses
An informative article with tips and best practices for prompt engineering with LLMs.
An Opinionated Evals Reading List
An opinionated reading list for learning how to evaluate large language models.
Libraries & Code
Software design principles for machine learning applications.
NannyML/The-Little-Book-of-ML-Metrics
The open-source repo of the book every data scientist needs.
Papers & Publications
Pyramidal Flow Matching for Efficient Video Generative Modeling
Abstract:
Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient
Abstract:
We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these refined LLCs (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for developmental interpretability, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Abstract:
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency.