Deep Learning Weekly: Issue 440

Terminally online Mistral Vibe, ATLAS: Practical scaling laws for multilingual models, a paper on GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization, and m

Jan 29, 2026

This week in deep learning, we bring you Terminally online Mistral Vibe., ATLAS: Practical scaling laws for multilingual models and a paper on GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization.

You may also enjoy Moonshot AI releases open-source Kimi K2.5 model with 1T parameters, The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language, a paper on Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Terminally online Mistral Vibe.

Mistral launches Vibe 2.0, a terminal-native coding agent powered by Devstral 2, featuring custom subagents, multi-choice clarifications, and slash-command skills.

Moonshot AI releases open-source Kimi K2.5 model with 1T parameters

Moonshot AI releases open-source Kimi K2.5, a 1 trillion parameter mixture-of-experts model trained on 15 trillion tokens that outperforms GPT-5.2 on several benchmarks including the challenging HLE-Full evaluation.

Node-based design tool Flora raises $42M from Redpoint Ventures

Flora, an AI-powered design platform, raises $42M Series A led by Redpoint Ventures to democratize creative workflows through multimodal generative AI and infinite canvas collaboration.

Go Deep - Amp

Amp launches “deep” mode powered by GPT-5.2-Codex, a highly autonomous coding agent that silently researches codebases for 5-15 minutes before making changes, complementing their interactive “smart” mode for different workflow needs.

NVIDIA Launches Earth-2 Family of Open Models — the World’s First Fully Open, Accelerated Set of Models and Tools for AI Weather

NVIDIA launches Earth-2 family of open weather AI models—the world’s first fully open, accelerated weather forecasting stack—offering models for 15-day global forecasts, local storm prediction, and data assimilation that run up to 500x faster than traditional physics-based methods.

Open Coding Agents: Fast, accessible coding agents that adapt to any repo | Ai2

Allen Institute for AI launches Open Coding Agents featuring SERA, an open-source coding agent, enabling repository-specific specialization where 32B models match 100B+ teachers on private codebases.

MLOps & LLMOps

The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language

A technical blog post detailing Netflix’s implementation of LLM-powered natural language search for their Graph Search platform, transforming structured GraphQL queries into intuitive text-based interfaces for enterprise data discovery.

Learning

ATLAS: Practical scaling laws for multilingual models

Google Research introduces ATLAS (Adaptive Transfer Scaling Laws), the largest public multilingual pre-training study with 774 training runs across 400+ languages.

Arcee AI | Trinity Large: An Open 400B Sparse MoE Model

A technical deep-dive on Arcee AI’s Trinity Large, a 400B parameter sparse MoE model with 13B active parameters achieving frontier-class performance at 2-3x faster inference than peers, trained in 33 days for $20M total cost.

AI open models have benefits. So why aren’t they more widely used?

A research article examining why open AI models, despite achieving 90% of closed-model performance at 87% lower cost, account for only 20% of usage while closed models dominate most of the market.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

aiming-lab/SimpleMem

Efficient Lifelong Memory for LLM Agents

Papers & Publications

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Abstract:

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Abstract:

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?