Deep Learning Weekly: Issue 435

Announcing the Future of AI Engineering: Self-Optimizing Agents, Fantastic Bugs and Where to Find Them in AI Benchmarks, Motif-2-12.7B-Reasoning: A Guide to RL Training Recipes, and more!

Dec 18, 2025

This week in deep learning, we bring you Announcing the Future of AI Engineering: Self-Optimizing Agents, Fantastic Bugs and Where to Find Them in AI Benchmarks, and a paper on Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes.

You may also enjoy Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation, Letta Code: A Memory-First Coding Agent, a paper on Evaluating AI’s ability to perform scientific research tasks, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3 Flash: frontier intelligence built for speed

Google announced Gemini 3 Flash, achieving 90.4% on GPQA Diamond and 78% on SWE-bench Verified while being 3x faster than Gemini 2.5 Pro at $0.50 per million input tokens.

Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation

Meta announces SAM Audio, the first unified multimodal model enabling intuitive audio separation through text, visual, or temporal prompts, achieving state-of-the-art performance..

NVIDIA Debuts Nemotron 3 Family of Open Models

NVIDIA launches the Nemotron-3 family of open-source AI models, offering developers new tools for building and deploying customizable language models across various applications.

Interactions API: A unified foundation for models and agents

Google launches Interactions API, featuring server-side state management, background execution, and access to Gemini Deep Research agent.

Introducing OpenSearch 3.4

OpenSearch announces version 3.4 release, introducing new features and improvements to the open-source search and analytics suite for enhanced performance, security, and developer experience.

New method enables small language models to solve complex reasoning tasks

MIT CSAIL researchers develop a training method that enables small language models to perform complex reasoning tasks by learning to generate internal “thought” processes, achieving comparable results to much larger models.

MLOps & LLMOps.

Announcing the Future of AI Engineering: Self-Optimizing Agents

A blog post exploring how self-optimizing agents use continuous evaluation and feedback loops to automatically improve prompts, tools, and behaviors over time, moving beyond static agent design toward systems that learn and adapt in production.

Letta Code: A Memory-First Coding Agent

A blog post introducing Letta Code, a memory-first coding agent that ranks #1 among model-agnostic open-source harnesses on TerminalBench.

AISAQ in Milvus: Billion-Scale Vector Search Just Got 3,200× Cheaper on Memory

A technical article introducing AISAQ, a disk-based vector index achieving 3,200× memory reduction (32 GB to 10 MB) for billion-scale vector search by storing all data on SSD with optimized layouts.

Learning

Fantastic Bugs and Where to Find Them in AI Benchmarks

An article introducing a measurement-theoretic framework that identifies flawed questions in AI benchmarks with up to 84% precision, detecting issues across nine widely used datasets.

How to Build Privacy-Preserving Evaluation Benchmarks with Synthetic Data

A technical tutorial demonstrating how to build privacy-preserving AI evaluation benchmarks using NVIDIA NeMo Data Designer and NeMo Evaluator to generate synthetic datasets.

How to Fine-Tune an LLM on NVIDIA GPUs With Unsloth

A guide about fine-tuning LLMs using Unsloth on NVIDIA DGX Cloud and Spark, demonstrating how to customize AI models for specific tasks with improved performance and efficiency.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

andrewyng/aisuite

Simple, unified interface to multiple Generative AI providers

Papers & Publications

Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes

Abstract:

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

Evaluating AI’s ability to perform scientific research tasks

Abstract:

We introduce FrontierScience, a benchmark evaluating AI capabilities for expert-level scientific reasoning. FrontierScience consists of two tracks: (1) Olympiad, which contains international olympiad problems (at the level of IPhO, IChO, and IBO), and (2) Research, which contains PhD-level, open-ended problems representative of sub-problems in scientific research. In total, FrontierScience is composed of several hundred questions (160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. In contrast, all Olympiad problems are originally produced by international olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we also introduce a granular rubric-based architecture to evaluate model capabilities throughout the process of solving a research task, as opposed to judging a standalone answer. In initial evaluations of several frontier models, GPT-5.2 is the top performing model on FrontierScience, scoring 77% on the Olympiad set and 25% on the Research set.

A guest post by

Miko Planas

~~~

Neural Foundry

Dec 18

The self-optimizing agents piece is interesting timing given the recent discussions around recursive improvement. Weve seen agents get better at specific tasks through RL, but the jump to systems that improve their own training process is still mostly theoretical. The Motif-2 paper on RL training recipes is probably more actionable for most practioners right now than the agentic stuff.

Jesús Martínez

It's time to address these often-overlooked problems, the consequences of which will complicate matters over time. With that in mind, I believe though I could be wrong that it's not simply a matter of AI "reasoning" better; rather, we are training systems that learn to correct themselves, just as the brain does with mistakes and experience. The challenge isn't power, but the criteria we use to apply it. Thank you for bringing these issues to the forefront, as they are already shaping how we think and make decisions.

Deep Learning Weekly

Discussion about this post

Ready for more?