Deep Learning Weekly: Issue 404

Sakana AI’s Continuous Thought Machines, Sleep-time Compute, a paper on Absolute Zero: Reinforced Self-play Reasoning with Zero Data, and many more!

May 14, 2025

This week in deep learning, we bring you 'Continuous Thought Machines' to make models reason with less guidance — like human brains, Sleep-time Compute, and a paper on Absolute Zero: Reinforced Self-play Reasoning with Zero Data.

You may also enjoy Mistral Medium 3, MetaShuffling: Accelerating Llama 4 MoE Inference, a paper on Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

'Continuous Thought Machines' to make models reason with less guidance — like human brains

Tokyo-based AI startup Sakana has unveiled a new type of AI model architecture called Continuous Thought Machines (CTM).

Medium is the new large. | Mistral AI

The Mistral team announced Mistral Medium 3, a new class of models that delivers state-of-the-art performance at 8x lower cost with simplified enterprise deployments.

Massive Foundation Model for Biomolecular Sciences Now Available via NVIDIA BioNeMo

Scientists everywhere can now access Evo 2, a powerful new foundation model that understands the genetic code for all domains of life.

AI startup Zencoder launches coding agent platform

Zencoder, officially For Good AI Inc., introduced a cloud platform called Zen Agents that can be used to create coding-optimized AI agents.

MLOps & LLMOps

LlamaIndex agentic workflows: Deep Research code-along

A notebook that walks through building agentic workflows, culminating in building a Deep Research workflow.

How to think about agent frameworks

An insightful blog post analyzing different AI agent frameworks, contrasting agents and workflows, and highlighting challenges in building reliable agentic systems.

How to Build an MCP Server with Gradio

A guide that shows you how to use Gradio to build an MCP server in just a few lines of Python.

Learning

Sleep-time Compute

A blog post introducing the concept of "sleep-time compute," allowing stateful AI agents to process information and deepen understanding during downtime by modifying memory state

Vector Search in the Real World: How to Filter Efficiently Without Killing Recall

A practical blog post discussing challenges and optimizations for efficient metadata filtering in production vector search while maintaining high recall.

Multi-Modal Retrieval using VoyageAI Multi-Modal Embeddings

A notebook that demonstrates Multi-Modal Retrieval using VoyageAI MultiModal Embeddings.

MetaShuffling: Accelerating Llama 4 MoE Inference

A technical blog post detailing MetaShuffling, a method to accelerate Llama 4 Mixture-of-Experts model inference on PyTorch by optimizing kernel and runtime designs.

Accents in Latent Spaces

An article about how BoldVoice uses machine learning and latent spaces to quantify and coach English accent strength.

Libraries & Code

ruc-nlpir/webthinker

A deep research framework fully powered by large reasoning models (LRMs)

maitrix-org/voila

Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level.

Papers & Publications

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Abstract:

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models

Abstract:

We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blueprints. While LLMs demonstrate remarkable capabilities across diverse tasks, they often fail to maintain adherence to complex, use-case-specific instructions during multi-turn conversations, presenting challenges for business-critical applications. ARQs address this limitation by guiding LLMs through systematic reasoning steps with targeted queries that reinstate critical instructions and facilitate intermediate reasoning throughout the completion process. In extensive testing within Parlant, our framework for reliable customer-facing agents in which ARQs were born out of necessity, they achieved a 90.2% success rate across 87 test scenarios, outperforming both Chain-of-Thought reasoning (86.1%) and direct response generation (81.5%). ARQs showed particular strength in addressing persistent failure modes like guideline re-application and hallucination prevention. Our analysis also revealed that ARQs can potentially be more computationally efficient than free-form reasoning when carefully designed. These findings demonstrate that structured reasoning approaches provide effective mechanisms for controlling how LLMs process information and make decisions in complex scenarios.

Rethinking Reflection in Pre-Training

Abstract:

A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Deep Learning Weekly: Issue 404

Sakana AI’s Continuous Thought Machines, Sleep-time Compute, a paper on Absolute Zero: Reinforced Self-play Reasoning with Zero Data, and many more!

Industry

MLOps & LLMOps

Learning

Libraries & Code

Papers & Publications

Discussion about this post