Deep Learning Weekly: Issue 413
The General-Purpose AI Code of Practice, Stop Saying RAG Is Dead, a paper on WebDancer: Towards Autonomous Information Seeking Agency, and many more!
This week in deep learning, we bring you The General-Purpose AI Code of Practice, Stop Saying RAG Is Dead, and a paper on WebDancer: Towards Autonomous Information Seeking Agency.
You may also enjoy Mistral AI's Voxtral, Graph foundation models for relational data, a paper on Archon: An Architecture Search Framework for Inference-Time Techniques, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The General-Purpose AI Code of Practice
The EU published the General-Purpose AI (GPAI) Code of Practice, a voluntary tool designed to help industry comply with the AI Act’s obligations.
The Mistral AI team introduced a family of frontier open source speech understanding models called Voxtral.
The Kiro team announced Kiro, an AI IDE that helps you deliver from concept to production through a simplified developer experience for working with AI agents.
Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning
Microsoft unveiled a new edition to the Phi model family: Phi-4-mini-flash-reasoning.
New AI system uncovers hidden cell subtypes, boosts precision medicine
CellLENS reveals hidden patterns in cell behavior within tissues, offering deeper insights into cell heterogeneity — vital for advancing cancer immunotherapy.
Harmonic raises $100M at nearly $900M valuation to scale AI model for formal mathematical reasoning
AI for formal mathematical reasoning startup Harmonic AI announced that it has raised $100 million in new funding to accelerate the commercialization of its flagship model.
MLOps & LLMOps
Stop Saying RAG Is Dead – Hamel’s Blog
Hamel Husain and Ben Clavie’s 5-part series discussing why RAG is not dead, covering modern metrics all the way up to RAGs with multiple representations.
Building Self-Evolving Knowledge Graphs Using Agentic Systems
An article that navigates the impact of graph databases and how AI agents address the limitations of static graphs with continuous knowledge base expansion and enrichment.
Learning
Introducing FlexOlmo: a new paradigm for language model training and data collaboration
The Ai2 team introduced FlexOlmo, a new paradigm for language model training that enables co-development of AI through data collaboration.
Graph foundation models for relational data
A blog post detailing the design of graph foundation models (GFM) that learn transferable graph representations, thus addressing key challenges for relational data.
Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models
The Hugging Face team announced the release of Kimina-Prover-72B, a state-of-the-art theorem proving model trained with the Kimi k1.5 RL pipeline based on Qwen2.5-72B.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
A unified framework and environment designed to guide the development of domain-specific agents for industrial asset operations and maintenance.
Papers & Publications
WebDancer: Towards Autonomous Information Seeking Agency
Abstract:
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models.
Archon: An Architecture Search Framework for Inference-Time Techniques
Abstract:
Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI's o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Abstract:
What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.