Deep Learning Weekly: Issue 444

Gemini 3.1 Pro, A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026, a paper on Does Your Reasoning Model Implicitly Know When to Stop Thinking?, and many more!

Feb 26, 2026

This week in deep learning, we bring you Gemini 3.1 Pro, A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 and a paper on Does Your Reasoning Model Implicitly Know When to Stop Thinking?.

You may also enjoy Anthropic’s Remote Control, How we caught our AI agent embezzling tokens, a paper on On Data Engineering for Scaling LLM Terminal Capabilities, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google launches Gemini 3.1 Pro, claiming more than double the reasoning performance of its predecessor on complex logic benchmarks, now rolling out across developer, enterprise, and consumer products.

Anthropic just released a mobile version of Claude Code called Remote Control

Anthropic launches Claude Code Remote Control, a new feature enabling developers to initiate coding sessions on their local terminal and seamlessly continue them from any mobile device or browser without moving code to the cloud.

Cohere Labs Launches Tiny Aya, Making Multilingual AI Accessible

Cohere Labs releases Tiny Aya, a 3.35B open-weight model claiming top multilingual performance in its size class across region-specific language variants.

Cursor agents can now control their own computers

Cursor launches cloud agents that run in isolated VMs with full computer-use capabilities, producing merge-ready PRs with video/screenshot artifacts to validate their work across web, mobile, Slack, and GitHub.

Visual imitation learning: Guidde trains AI agents on human ‘expert video’ instead of documentation

Guidde raises $50M to train AI agents on expert screen-recording videos instead of static documentation, cutting video creation time by 41% and support tickets by 34%.

The public opposition to AI infrastructure is heating up

Bipartisan opposition to AI data centers is escalating across the U.S., with states like New York proposing three-year construction moratoriums and communities pulling tax incentives, even as Big Tech commits $650B in infrastructure spending.

MLOps/LLMOps

How we caught our AI agent embezzling tokens

A PostHog engineering deep-dive into how they traced, diagnosed, and reduced their AI Wizard agent’s $6.67/run inference cost — uncovering three “token embezzlement” patterns and counterintuitive findings about context management and caching.

Learning

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

A comprehensive architectural deep-dive comparing 10 major open-weight LLM releases from January–February 2026, highlighting the convergence toward hybrid attention mechanisms and efficiency-first design across models ranging from 3B to 1T parameters.

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

An engineering blog post about how Netflix built MediaFM, its first in-house tri-modal (audio, video, text) foundation model trained on tens of millions of catalog shots to power recommendations, ad relevancy, and promotional asset optimization at scale.

Detecting and preventing distillation attacks \ Anthropic

Anthropic exposes three Chinese AI labs — DeepSeek, Moonshot, and MiniMax — for running industrial-scale “distillation attacks” that illicitly extracted Claude’s capabilities across 16M+ exchanges through ~24,000 fraudulent accounts.

Expanding our analysis of biological AI models | Epoch AI

A comprehensive Epoch AI report cataloging 1,196 biological AI models across nine categories, revealing critical biosafety gaps and landscape trends commissioned by Sentinel Bio.

Teaching AI to read a map

Google Research introduces MapTrace, a fully automated synthetic data pipeline using Gemini and Imagen models to generate 2M annotated map path examples — teaching multimodal LLMs fine-grained spatial reasoning and reducing path-tracing error by 33% on real-world benchmarks.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

vxcontrol/pentagi

Fully autonomous AI Agents system capable of performing complex penetration testing tasks

Papers & Publications

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Abstract:

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

On Data Engineering for Scaling LLM Terminal Capabilities

Abstract:

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?