Deep Learning Weekly: Issue 408
Mistral's Magistral, Benchmarking Multi-Agent Architectures, a paper on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models, and many more!
This week in deep learning, we bring you Mistral's Magistral, Benchmarking Multi-Agent Architectures, and a paper on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.
You may also enjoy Meta's V-JEPA 2, KV Cache from scratch in nanoVLM, a paper on Learning to Reason without External Rewards, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The Mistral team announced Magistral their first reasoning model which excels in domain-specific, transparent, and multilingual reasoning.
Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning
Meta introduces V-JEPA 2, the first world model trained on video that enables state-of-the-art understanding and prediction, as well as zero-shot planning and robot control in new environments.
Glean nabs $150M in funding at $7.2B valuation
Glean, an enterprise search startup, announced that it has raised $150 million in late-stage funding.
MLOps & LLMOps
Benchmarking Multi-Agent Architectures
A blog post that explores a few common multi-agent architectures, discusses both the motivations and constraints of different architectures, and benchmarks their performance on a variant of the Tau-bench dataset.
Building a Production Multimodal Fine-Tuning Pipeline
A guide that demonstrates how to overcome the multimodal implementation gap, with a hands-on example fine-tuning Gemma 3 on the SIIM-ISIC Melanoma dataset.
Summarize Hacker News Posts with Haystack & OPEA
A step-by-step tutorial for building a RAG pipeline to fetch live Hacker News posts and summarizing them with a local LLM endpoint.
Learning
More efficient multi-vector embeddings with MUVERA
A blog post that dives into MUVERA, an encoding algorithm for multi-vector embeddings, in detail.
There Are Only 6 RAG Evals - Jason Liu
A blog post simplifying RAG evaluation into six core relationships between question, context, and answer.
GitHub MCP Exploited: Accessing private repositories via MCP
A blog post showcasing a vulnerability in GitHub MCP that allows attackers to access private repository data via toxic agent flows.
KV Cache from scratch in nanoVLM
An article explaining the implementation of KV Caching from scratch in nanoVLM for faster autoregressive language model generation.
Libraries & Code
Fully open data curation for reasoning models.
Papers & Publications
Abstract:
Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.
Learning to Reason without External Rewards
Abstract:
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Abstract:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.