Deep Learning Weekly: Issue 392
LLM Monitoring & Maintenance in Production Applications, Using AI to decode language from the brain and advance our understanding of human communication and much more!
This week in deep learning, we bring you LLM Monitoring & Maintenance in Production Applications, Using AI to decode language from the brain and advance our understanding of human communication, and a paper on Reinforcement Learning for Long-Horizon Interactive LLM Agents.
You may also enjoy Nathan Lambert’s RLHF Book, Automating the Search for Artificial Life with Foundation Models, a paper on s1: Simple test-time scaling, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Using AI to decode language from the brain and advance our understanding of human communication
In collaboration with the Basque Center on Cognition, Brain and Language, Meta shares two breakthroughs that show how AI can help advance our understanding of human intelligence, leading us closer to advanced machine intelligence.
User-friendly system can help developers build more efficient simulations and AI models
To improve the efficiency of AI models, MIT researchers created an automated system that enables developers of deep learning algorithms to take advantage of two types of data redundancy simultaneously.
MLOps & LLMOps
LLM Monitoring & Maintenance in Production Applications
An article exploring how teams use new LLM monitoring tools to track, measure, and optimize their applications over time in production.
Introducing AgentWorkflow: A Powerful System for Building AI Agent Systems
LlamaIndex introduced AgentWorkflow, a new system that makes building and orchestrating AI agent systems easy.
Learning
Nathan Lambert’s short introductory book to RLHF and post-training focused on language models.
From PDFs to Insights: Structured Outputs from PDFs with Gemini 2.0
A tutorial on how to extract structured information, like invoice numbers, and dates, directly from your PDF documents using Gemini 2.0.
A blog post on the definition of hybrid search, the role of sparse and dense vectors, as well as when to use hybrid search.
Automating the Search for Artificial Life with Foundation Models
Sakana AI highlights a new algorithm called Automated Search for Artificial Life (“ASAL”) to automate the discovery of artificial life using vision-language foundation models.
Libraries & Code
Transform data and create rich visualizations iteratively with AI.
Finetune Llama 3.3, DeepSeek-R1 & Reasoning LLMs 2x faster with 70% less memory.
Papers & Publications
Abstract:
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24.
Reinforcement Learning for Long-Horizon Interactive LLM Agents
Abstract:
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned large language models (LLMs) can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive LOOP, a data- and memory-efficient variant of proximal policy optimization. LOOP uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent trained with LOOP in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi-domain, multi-app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.