Deep Learning Weekly: Issue 414

Gemini with Deep Think achieves gold-medal standard at the International Mathematical Olympiad, a paper on Chain of Thought Monitorability, and many more!

Jul 23, 2025

This week in deep learning, we bring you Gemini with Deep Think achieves gold-medal standard at the International Mathematical Olympiad, and a paper on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.

You may also enjoy Mistral AI's contribution to a global environmental standard for AI, Inference Economics of Language Models, a paper on No time to train! Training-Free Reference-Based Instance Segmentation, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad

Gemini with Deep Think officially achieved gold-medal standard at the International Mathematical Olympiad (IMO) by solving five out of the six IMO problems.

Our contribution to a global environmental standard for AI

The Mistral team conducted a first-of-its-kind comprehensive study to quantify the environmental impact of their LLMs.

Lovable raises $200M at $1.8B valuation

Lovable announced that it has raised $200 million in an early-stage round that values the company at $1.8 billion.

Learning

Inference Economics of Language Models

An analytical article investigating how speed trades off against cost in language model inference and revealing important facts about scaling LLM serving.

The Big LLM Architecture Comparison

An article offering a comprehensive comparison of modern LLM architectural designs from DeepSeek-V3 to Kimi K2, highlighting key evolutions and distinctions in their structures.

Libraries & Code

nerfstudio-project/nerfstudio

Nerfstudio provides a simple API that allows for a simplified end-to-end process of creating, training, and testing NeRFs.

SillyTavern/SillyTavern

LLM Frontend for Power Users.

Papers & Publications

No time to train! Training-Free Reference-Based Instance Segmentation

Abstract:

The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Abstract:

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Abstract:

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present \textbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce \textbf{AudioCoT}, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Deep Learning Weekly: Issue 414

Gemini with Deep Think achieves gold-medal standard at the International Mathematical Olympiad, a paper on Chain of Thought Monitorability, and many more!

Industry

Learning

Libraries & Code

Papers & Publications

Discussion about this post