Deep Learning Weekly

Deep Learning Weekly: Issue 452

Miko Planas — Thu, 23 Apr 2026 15:01:16 GMT

This week in deep learning, we bring you Introducing Ollie: Auto-Fix Your Agent’s Codebase, Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles and a paper on Adam’s Law: Textual Frequency Law on Large Language Models.

You may also enjoy Claude Opus 4.7, Notion Vector Search Architecture, OpenThoughts: Data Recipes for Reasoning Models, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing ChatGPT Images 2.0

OpenAI releases ChatGPT Images 2.0, its first image model with native reasoning and web search, generating up to 8 coherent images per prompt at up to 2K resolution.

Introducing Claude Opus 4.7 \ Anthropic

Anthropic releases Claude Opus 4.7, a coding-focused upgrade over Opus 4.6 with significantly improved vision, a new xhigh effort level, and real-world cyber safeguards.

Introducing OpenAI Privacy Filter

OpenAI releases Privacy Filter, a 1.5B-parameter open-source, on-device PII detection and redaction model derived from gpt-oss, scoring 96% F1 on PII-Masking-300k.

Kimi K2.6 Tech Blog: Advancing Open-Source Coding

Moonshot AI open-sources Kimi K2.6, a coding and long-horizon agent model that scales agent swarms to 300 concurrent sub-agents across 4,000 coordinated steps, with benchmark results competitive with GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro and agentic tasks.

Google’s new Deep Research and Deep Research Max agents can search the web and your private data

Google launches two Gemini 3.1 Pro-powered autonomous research agents — Deep Research and Deep Research Max — that combine open web search with proprietary enterprise data via MCP in a single API call.

MLOps/LLMOps/AgentOps

Introducing Ollie: Auto-Fix Your Agent’s Codebase

Comet announces Ollie, a coding assistant embedded in the Opik platform that closes the observability-to-action loop by autonomously analyzing agent traces, diagnosing failures, patching code, and writing regression tests — all within a single workflow.

Introducing Opik Test Suites: Straightforward Unit & Regression Testing for AI Agents

Comet announces Opik Test Suites, a regression testing framework for AI agents that replaces dataset-based evaluation scores with software-style pass/fail assertions written in plain English.

Learning

Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles

A Google Research blog post introducing Simula, a reasoning-first synthetic data framework that treats dataset generation as mechanism design — controlling diversity, complexity, and quality as independent axes.

Benchmarking multimodal document search in OpenSearch: Three approaches compared

A technical benchmark comparing ColPali late-interaction reranking, BDA modality-aware embedding, and text-only chunking for multimodal document search in OpenSearch across quality, latency, and ingest performance on 1,000 report pages.

Notion Vector Search Architecture: What Comes Next

A blog post analyzing Notion’s two-year vector search evolution as a proxy for the harder infrastructure problems — offline context engineering, embedding model upgrades, and real-time/batch unification — that scaling multiple AI features will demand next.

Engram: Memory by Weaviate

Weaviate announces Engram, a managed memory service that uses async pipelines to extract, deduplicate, and maintain agent memories on top of Weaviate’s vector database.

Automated Weak-to-Strong Researcher

Anthropic’s Claude-powered Automated Alignment Researcher achieves a 0.97 performance gap recovered score on weak-to-strong supervision in 5 days — versus 0.23 by human researchers in 7 days.

Breaking Opus 4.7 with ChatGPT (Hacking Claude’s Memory)

A security research post demonstrating a ChatGPT-generated adversarial image that successfully hijacked Claude Opus 4.7’s memory tool via indirect prompt injection — succeeding 5 out of 10 attempts before Anthropic patched the specific exploit within 24 hours.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

google/skills

Agent Skills for Google products and technologies

Papers & Publications

Adam’s Law: Textual Frequency Law on Large Language Models

Abstract:

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

OpenThoughts: Data Recipes for Reasoning Models

Abstract:

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.

Deep Learning Weekly: Issue 451

Miko Planas — Thu, 16 Apr 2026 15:03:11 GMT

This week in deep learning, we bring you The 2026 AI Index Report, MirrorCode: Evidence that AI can already do some weeks-long coding tasks and a paper on Introspective Diffusion Language Models.

You may also enjoy Gemini Robotics ER 1.6: Enhanced Embodied Reasoning, Should AI Step Aside?: Teaching Agents When Humans Want to Intervene, a paper on CodeTracer: Towards Traceable Agent States, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

The 2026 AI Index Report | Stanford HAI

Stanford HAI releases the 2026 AI Index — a 400+ page annual report tracking AI’s technical performance, investment, labor market effects, policy landscape, and public sentiment across nine chapters.

Gemini Robotics ER 1.6: Enhanced Embodied Reasoning

Google DeepMind releases Gemini Robotics-ER 1.6, a robotics-specialized reasoning model with upgraded spatial reasoning, multi-view success detection, and instrument reading (93% accuracy with agentic vision).

Introducing Muse Spark: Scaling Towards Personal Superintelligence

Meta’s Superintelligence Labs launches Muse Spark — a natively multimodal reasoning model with multi-agent “Contemplating” mode that achieves 58% on Humanity’s Last Exam.

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

Google launches Gemini 3.1 Flash TTS, a text-to-speech model with natural-language audio tags for granular vocal control across 70+ languages, available via Gemini API, Vertex AI, and Google Vids.

Introducing MAI-Image-2-Efficient: Faster, More Efficient Image Generation

Microsoft releases MAI-Image-2-Efficient — 22% faster and 4x more GPU-efficient than MAI-Image-2, targeting high-volume and real-time image generation workloads.

Introducing routines in Claude Code

Anthropic launches Routines in Claude Code — serverless automations triggered by schedule, API call, or GitHub webhook events, with daily limits of 5–25 runs depending on plan tier.

MLOps/LLMOps

Multimodal LLM Evaluation: A Developer’s Guide to Multimodal Language Models

A guide to evaluating multimodal LLMs, highlighting why text-only metrics fall short for image, audio, and video inputs, while outlining methods for grounding outputs and using LLM-based evaluation to measure real-world performance.

Learning

Should AI Step Aside?: Teaching Agents When Humans Want to Intervene

A research blog post introducing CowCorpus and PlowPilot — a dataset and intervention-aware web agent system that predicts when users want to take over, yielding a 26.5% improvement in user-rated usefulness over a fully autonomous baseline.

MirrorCode: Evidence that AI can already do some weeks-long coding tasks

A research report from Epoch AI introducing MirrorCode, a long-horizon coding benchmark, showing Claude Opus 4.6 can autonomously reimplement a 16,000-line bioinformatics toolkit estimated to take a human engineer 2–17 weeks.

8 Tips for Writing Agent Skills

A practical guide on authoring effective agent skills, covering description precision, instruction conciseness, layered context loading, and when to retire skills as model capabilities advance.

The AI Revolution in Math Has Arrived

A Quanta Magazine feature documenting how AI has become a genuine research accelerator, with mathematicians using it to discover and prove new results in days rather than months.

Gemma 4 Fine-tuning Guide

Unsloth’s technical guide for fine-tuning Google’s Gemma 4 family covering VRAM requirements, critical bug fixes for KV-sharing and gradient accumulation, and recipes for SFT, vision, audio, and GRPO training.

Towards developing future-ready skills with generative AI

A Google blog post introducing Vantage, a GenAI-powered assessment platform that places students in AI-simulated multi-party conversations to measure “future-ready” skills.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

LLM Wiki

A pattern for building personal knowledge bases using LLMs.

Papers & Publications

Introspective Diffusion Language Models

Abstract:

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

CodeTracer: Towards Traceable Agent States

Abstract:

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent’s state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

Deep Learning Weekly: Issue 450

Miko Planas — Thu, 09 Apr 2026 15:01:40 GMT

This week in deep learning, we bring you Gemma 4, Components of A Coding Agent, and a paper on VOID: Video Object and Interaction Deletion.

You may also enjoy Claude Managed Agents, Evaluating alignment of behavioral dispositions in LLMs, a paper on TriAttention: Efficient Long Reasoning with Trigonometric KV Compression, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Qwen: Qwen3.6-Plus: Towards Real World Agents

Alibaba launches Qwen3.6-Plus, a frontier agentic coding model that matches or beats Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0.

Claude Managed Agents: get to production 10x faster

Anthropic launches Claude Managed Agents in public beta — a suite of composable, cloud-hosted agent APIs that abstract away sandboxing, state management, permissioning, and orchestration, enabling teams to ship production agents in days instead of months.

Gemma 4: Our most capable open models to date

Google releases Gemma 4, a family of four open models — with the 31B ranking #3 among open models on Arena AI and outcompeting models 20x its size.

Modus secures $85M to expand AI-powered audit and accounting partnerships

Modus Audit raises $85M to deploy AI across audit and accounting firm workflows.

Ollama is now powered by MLX on Apple Silicon in preview

Ollama 0.19 launches MLX-powered inference on Apple Silicon, delivering ~2x gains in prefill and decode speed on M5 chips, with NVFP4 quantization support and smarter KV cache reuse for agentic workloads.

MLOps/LLMOps

AI Agent Evaluation: Building Reliable Systems Beyond Simple Testing

A practical guide on why standard LLM evaluation breaks for agentic systems, covering compounding failures, process vs. outcome metrics, multi-turn state tracking, and the trace-evaluate-optimize loop needed for production agents.

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

A technical blog about ActorSimulator in AWS’s Strands Evals SDK, which generates persona-consistent, goal-driven simulated users to automate multi-turn agent evaluation at scale.

Learning

Components of A Coding Agent

A breakdown by Sebastian Raschka of the six architectural components that make coding agents (Claude Code, Codex CLI) meaningfully more capable than raw LLMs in a chat UI

Quantization from the ground up

A highly interactive, ground-up explainer on LLM quantization covering floating point formats, symmetric vs. asymmetric compression, outlier handling, and empirical quality/speed tradeoffs.

Evaluating alignment of behavioral dispositions in LLMs

A blog post on evaluating behavioral alignment across 25 LLMs, finding frontier models hit ~80–83% alignment with human consensus but are systematically overconfident in ambiguous scenarios and inconsistent between self-reported and revealed behavior.

Libraries & Code

comet-ml/opik

NousResearch/hermes-agent

The self-improving AI agent built by Nous Research. It’s the only agent with a built-in learning loop.

Papers & Publications

VOID: Video Object and Interaction Deletion

Abstract:

Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Abstract:

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

Deep Learning Weekly: Issue 449

Miko Planas — Thu, 02 Apr 2026 15:03:12 GMT

This week in deep learning, we bring you Gemini 3.1 Flash Live, Cohere Transcribe: state-of-the-art speech recognition, and a paper on IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.

You may also enjoy Mistral AI’s Voxtral, How Kimi, Cursor, and Chroma Train Agentic Models with RL, a paper on Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

Google launches Gemini 3.1 Flash Live, its highest-quality real-time audio model, scoring 90.8% on ComplexFuncBench Audio and 36.1% on AudioMultiChallenge

Speaking of Voxtral

Mistral launches Voxtral TTS, a 4B-parameter multilingual text-to-speech model supporting 9 languages with 70ms latency, voice cloning from 3-second samples, and more.

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

Meta updates SAM 3 to SAM 3.1, adding object multiplexing to double video processing speed to 32 FPS on a single H100 for its open-source text-prompted segmentation and tracking model.

Cohere Transcribe: state-of-the-art speech recognition

Cohere launches Transcribe, a 2B-parameter open-source ASR model that tops the HuggingFace Open ASR Leaderboard with a 5.42% average word error rate across 14 languages,

Granola raises $125M at $1.5B valuation for its AI note-taking app

Granola raises $125M Series C at a $1.5B valuation led by Index Ventures, following a quarter of 250% revenue growth, with plans to expand its AI meeting notes app toward agentic task automation.

MLOps/LLMOps

Deploying Disaggregated LLM Inference Workloads on Kubernetes

A technical guide to deploying disaggregated LLM inference (separate prefill, decode, and router services) on Kubernetes using NVIDIA Grove, KAI Scheduler, and NVIDIA Dynamo.

Learning

Best Embedding Model for RAG 2026: 10 Models Compared

A practical benchmarking guide comparing 10 embedding models across four production-critical RAG dimensions — cross-modal, cross-lingual, long-document retrieval, and MRL compression

How Kimi, Cursor, and Chroma Train Agentic Models with RL

A technical synthesis of three recent agentic RL training reports — Kimi K2.5, Cursor Composer 2, and Chroma Context-1 — distilling shared patterns around production-environment training, context management, and reward design.

Multimodal Embeddings and RAG: A Practical Guide

A practical guide to multimodal embeddings and RAG covering the core theory (contrastive learning, modality gap, MRL), three concrete build patterns (audio, PDF, video), and when multimodal actually outperforms text-only pipelines.

Five techniques to reach the efficient frontier of LLM inference

A practical guide to LLM inference optimization framed around the “efficient frontier” concept — five techniques that move production systems toward the latency/throughput Pareto boundary without additional hardware spend.

Libraries & Code

comet-ml/opik

katanemo/plano

Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.

Papers & Publications

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Abstract:

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Abstract:

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L2) to O(Lk). However, the indexer itself retains O(L2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer’s top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82× prefill speedup and 1.48× decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model.

Deep Learning Weekly: Issue 448

Miko Planas — Thu, 26 Mar 2026 15:03:09 GMT

This week in deep learning, we bring you Cursor’s Composer 2, TurboQuant: Redefining AI efficiency with extreme compression and a paper on Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation.

You may also enjoy What 81,000 people want from AI \ Anthropic, Evaluating agentic search in OpenSearch, a paper on OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing Composer 2 · Cursor

Cursor launches Composer 2, a frontier-level coding model trained via continued pretraining and long-horizon RL that scores 61.3 on CursorBench and 73.7 on SWE-bench Multilingual

What 81,000 people want from AI \ Anthropic

Anthropic’s largest-ever qualitative study — 80,508 Claude users across 159 countries and 70 languages — reveals what people want from AI, what they’ve already gotten, and what they fear.

Lyria 3 Pro: Create longer tracks in more Google products

Google launches Lyria 3 Pro, an upgraded music generation model that produces tracks up to 3 minutes with structural song awareness (intros, verses, choruses, bridges).

MolmoWeb: An open agent for automating web tasks

Allen AI releases MolmoWeb, a fully open visual web agent built on Molmo 2 that scores 78.2% on WebVoyager and 73.7% on SWE-bench Multilingual, outperforming GPT-4o-based agents while releasing all weights, training data, and evaluation tools.

OpenAI to acquire Astral

OpenAI acquires Astral — maker of Python developer tools uv, Ruff, and ty used by millions of developers — to deepen its Codex ecosystem.

MLOps/LLMOps

Building an MCP Ecosystem at Pinterest

Pinterest Engineering details how they scaled MCP from concept to a production ecosystem of domain-specific servers — Presto, Spark, Knowledge — with a central registry, two-layer auth, and 66,000 monthly invocations saving an estimated 7,000 engineer-hours per month.

Run cloud agents in your own infrastructure

Cursor launches self-hosted cloud agents GA, keeping code and tool execution entirely within enterprise infrastructure while Cursor handles orchestration and inference.

Learning

TurboQuant: Redefining AI efficiency with extreme compression

Google Research releases TurboQuant, a KV cache quantization method that achieves 6x+ memory reduction to 3 bits with zero accuracy loss and 8x attention speedup on H100s.

Evaluating agentic search in OpenSearch

A technical deep-dive on how OpenSearch benchmarked its agentic search feature across search relevance (BEIR and BRIGHT datasets) and query execution accuracy (Spider dataset), powered by Claude Opus 4.6.

Scaling Karpathy’s Autoresearch: What Happens When the Agent Gets a GPU Cluster

A technical blog post on how SkyPilot scaled Karpathy’s autoresearch agent from 1 to 16 GPUs, enabling ~910 experiments in 8 hours.

How Anthropic’s Claude Thinks - ByteByteGo Newsletter

ByteByteGo breaks down Anthropic’s interpretability research into six concrete findings about how Claude actually thinks — from parallel math strategies to ahead-of-time poetry planning to a default-refusal circuit that misfires into hallucinations.

A Visual Guide to Attention Variants in Modern LLMs

A visual reference guide mapping seven attention variants — MHA, GQA, MLA, SWA, DeepSeek Sparse Attention, Gated Attention, and hybrid architectures — across the open-weight models currently using them in production.

Fast regex search: indexing text for agent tools

A technical deep-dive on how Cursor built a local sparse n-gram index to replace ripgrep for agent search — eliminating 15+ second grep latency in large monorepos by narrowing regex matches to a pre-filtered candidate set before full scanning.

Libraries & Code

comet-ml/opik

openai/teen-safety-policy-pack

A set of prompt-based safety policies designed to create age-appropriate protections for teens.

Papers & Publications

OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

Abstract:

Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy.

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Abstract:

Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-k similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader’s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.

Deep Learning Weekly: Issue 447

Deep Learning Weekly — Thu, 19 Mar 2026 15:31:00 GMT

This week in deep learning, we bring you Mamba-3, Agent-native Architectures: How to Build Apps After Code Ends and a paper on Attention Residuals.

You may also enjoy Introducing Mistral Small 4, State of RL for reasoning LLMs, a paper on Data Agents: Levels, State of the Art, and Open Problems, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Mamba-3

Together AI releases Mamba-3, an inference-first state space model that outperforms Mamba-2, Gated DeltaNet, and Transformer-based Llama-3.2-1B on end-to-end latency at the 1.5B scale.

Introducing Mistral Small 4

Mistral releases Small 4 — an open-source, 119B-parameter MoE model unifying reasoning, multimodal, and coding capabilities, delivering 40% lower latency and 3x higher throughput than its predecessor.

Claude builds interactive visuals right in your conversation

Anthropic launches inline interactive charts, diagrams, and visualizations in Claude chat — available in beta across all plan types.

Measuring Progress Towards AGI: A Cognitive Framework

Google DeepMind releases a cognitive taxonomy paper proposing 10 human-grounded abilities to measure AGI progress, paired with a $200,000 Kaggle hackathon to crowdsource the missing benchmarks.

Gumloop reels in $50M for its AI automation platform

Gumloop raises $50M Series B led by Benchmark — with participation from Shopify Ventures and Y Combinator — bringing total funding to $70M for its no-code, drag-and-drop AI agent automation platform.

Okta unveils new framework to manage AI agents and upcoming Okta for AI Agents platform

Okta unveils a security blueprint for the agentic enterprise and announces its “Okta for AI Agents” platform – treating AI agents as governed, non-human identities with centralized access control and a kill switch for rogue agents.

MLOps/LLMOps

Agent-native Architectures: How to Build Apps After Code Ends

A technical guide on building agent-native applications — software architectures where agents are first-class citizens, using atomic tools and outcome-driven loops instead of hardcoded workflows.

Learning

State of RL for reasoning LLMs

A technical deep-dive surveying the evolution of reinforcement learning algorithms for reasoning LLMs (2024–2026), tracing the lineage from REINFORCE and PPO through GRPO and eight successor methods

Many SWE-bench-Passing PRs Would Not Be Merged into Main - METR

METR researchers found that roughly half of SWE-bench Verified PRs that pass the automated grader would not actually be merged by real repo maintainers, with automated grader scores averaging 24 percentage points higher than maintainer merge rates.

LumberChunker: Long-Form Narrative Document Segmentation

An article about LumberChunker, a RAG chunking method that uses an LLM to detect semantic boundaries in long-form narrative documents, achieving DCG@20 of 62.1% on the GutenQA benchmark — outperforming all fixed-size and recursive baselines.

VAGEN: Teaching Vision-Language Models to Build World Models Through Reinforcement Learning

A Stanford AI Lab research blog post about VAGEN, a reinforcement learning framework that trains 3B-parameter VLM agents to build internal world models via structured state estimation and transition predictions.

Libraries & Code

comet-ml/opik

tobi/qmd

A mini CLI search engine for your docs, knowledge bases, meeting notes, and more.

Papers & Publications

Attention Residuals

Abstract:

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.

Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Data Agents: Levels, State of the Art, and Open Problems

Abstract:

Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term “data agent” is currently used inconsistently, conflating simple query responsive assistants with aspirational fully autonomous “data scientists”. This ambiguity blurs capability boundaries and accountability, making it difficult for users, system builders, and regulators to reason about what a “data agent” can and cannot do.

In this tutorial, we propose the first hierarchical taxonomy of data agents from Level 0 (L0, no autonomy) to Level 5 (L5, full autonomy). Building on this taxonomy, we will introduce a lifecycleand level-driven view of data agents. We will (1) present the L0-L5 taxonomy and the key evolutionary leaps that separate simple assistants from truly autonomous data agents, (2) review representative L0-L2 systems across data management, preparation, and analysis, (3) highlight emerging Proto-L3 systems that strive to autonomously orchestrate end-to-end data workflows to tackle diverse and comprehensive data-related tasks under supervision, and (4) discuss forward-looking research challenges towards proactive (L4) and generative (L5) data agents. We aim to offer both a practical map of today’s systems and a research roadmap for the next decade of data-agent development.

AI Can Learn Scientific Taste

Abstract:

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

Deep Learning Weekly: Issue 446

Deep Learning Weekly — Thu, 12 Mar 2026 15:02:50 GMT

This week in deep learning, we bring you Native Observability & Alerts for Your OpenClaw with Opik, Gemini Embedding 2, and a paper on LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory.

You may also enjoy GPT-5.4, a paper on DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini Embedding 2: Our first natively multimodal embedding model

Google launches Gemini Embedding 2, its first natively multimodal embedding model unifying text, images, video, audio, and documents into a single semantic space.

OpenAI to acquire Promptfoo

OpenAI is acquiring Promptfoo — an AI security platform used by 25%+ of Fortune 500 companies — to embed red-teaming, jailbreak detection, and agentic risk evaluation natively into its enterprise Frontier platform.

Introducing GPT-5.4 | OpenAI

OpenAI launches GPT-5.4 with a 1M-token context, new Tool Search API, and record scores on coding and knowledge-work benchmarks — its most capable frontier model for professional and agentic use.

Google upgrades Gemini for Workspace allowing it to pull data from multiple apps to create Docs, Sheets, Slides and more

Google lets Gemini generate fully-formed Docs, Sheets, and Slides by pulling from Gmail, Drive, and Chat — turning Workspace into a single-prompt content creation engine.

Yann LeCun’s AMI Labs raises $1.03B to build world models | TechCrunch

Yann LeCun’s AMI Labs raises $1.03B at a $3.5B valuation to build JEPA-based world models — AI that learns from reality rather than language — with NVIDIA, Samsung, and Eric Schmidt among backers.

MLOps/LLMOps

Native Observability & Alerts for Your OpenClaw with Opik

A blog post announcing opik-openclaw, a native OpenClaw plugin from Comet that adds full-stack observability — tracing every LLM call, tool execution, token cost, and sub-agent delegation — to address the visibility gap in autonomous agent workflows.

Learning

Improving instruction hierarchy in frontier LLMs

A technical research post about OpenAI’s IH-Challenge — an RL training dataset that teaches models a strict trust hierarchy (System > Developer > User > Tool) to resist prompt injection, jailbreaks, and instruction conflicts.

Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

A technical blog post about NVIDIA’s concept-driven synthetic data pipeline that generated 15M Python programming problems, yielding a 6-point HumanEval gain (73→79) when included in Nemotron-Nano-v3 pretraining.

Practical Guide to Evaluating and Testing Agent Skills

A practical guide about building lightweight eval harnesses for agent skills, walking through how to define success criteria, construct prompt sets, and iterate — illustrated by taking a Gemini Interactions API skill from 66.7% to 100% pass rate.

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

A Stanford benchmark revealing that frontier models (GPT-5.2, Gemini-3 Pro, Claude 4.5 Sonnet) all fail to build accurate, revisable cognitive maps during active spatial exploration — humans consistently outperform all of them.

Libraries & Code

comet-ml/opik

alibaba/page-agent

JavaScript in-page GUI agent. Control web interfaces with natural language.

Papers & Publications

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Abstract:

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Abstract:

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

Deep Learning Weekly: Issue 445

Miko Planas — Thu, 05 Mar 2026 16:03:02 GMT

This week in deep learning, we bring you Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems, Nano Banana 2: Combining Pro capabilities with lightning-fast speed and a paper on Beyond Language Modeling: An Exploration of Multimodal Pretraining.

You may also enjoy Gemini 3.1 Flash-Lite, Personalization features can make LLMs more agreeable, a paper on dLLM: Simple Diffusion Language Modeling, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.1 Flash-Lite: Built for intelligence at scale

Google launches Gemini 3.1 Flash-Lite in preview, positioning it as their fastest and most cost-efficient model yet at $0.25/1M input tokens — built specifically for high-volume developer workloads demanding both speed and reasoning.

GPT-5.3 Instant: Smoother, more useful everyday conversations

OpenAI releases GPT-5.3 Instant as the new default ChatGPT model, cutting hallucinations by up to 26.8% and dramatically reducing the over-cautious, “cringe” responses that frustrated everyday users.

Statement from Dario Amodei on our discussions with the Department of War

Anthropic’s Dario Amodei publicly refuses Department of War demands to remove AI safeguards on mass domestic surveillance and fully autonomous weapons.

Did Alibaba just kneecap its powerful Qwen AI team? Key figures depart in wake of latest open source release

Alibaba’s Qwen AI team loses its founding technical lead and two key researchers just 24 hours after shipping the Qwen3.5 small model series, raising alarm about the project’s open-source future and triggering a 5% drop in Alibaba’s stock.

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Microsoft releases Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that rivals much larger models on math, science, and computer-use tasks while requiring a fraction of the training compute.

Nano Banana 2: Combining Pro capabilities with lightning-fast speed

Google launches Nano Banana 2 (Gemini 3.1 Flash Image), combining the advanced quality of Nano Banana Pro with Flash-level speed, rolling out across Gemini, Search, Google Ads, Vertex AI, and Flow.

MLOps/LLMOps

Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems

Announcing the new Opik Claude Code Plugin, which automatically instruments Python and JavaScript agent code with tracing, applies observability best practices, and logs what Claude Code is doing as it modifies a system.

Improve chatbot memory using Google Cloud

A practical guide about building scalable long-term memory for agentic chatbots using a three-tier polyglot storage architecture on Google Cloud (Redis, Bigtable, BigQuery).

Learning

Personalization features can make LLMs more agreeable

MIT/Penn State research finds LLM personalization features significantly amplify sycophantic behavior, with memory-stored user profiles having the greatest effect across 4 of 5 models tested in real two-week user interactions.

The threat of AI-generated code to the world’s digital infrastructure

An article about how AI-enabled “vibe contributing” — low-quality, AI-generated code submitted by novice contributors — is overwhelming volunteer open source maintainers and threatening the stability of global digital infrastructure.

Teaching LLMs to reason like Bayesians

A research blog post about how Google trained LLMs to reason like optimal Bayesian agents via fine-tuning on Bayesian model outputs, dramatically improving probabilistic belief-updating across domains.

Mixture of Experts (MoEs) in Transformers

A technical blog post about how Hugging Face redesigned the transformers library to make Mixture-of-Experts (MoE) models first-class citizens, covering weight loading, expert routing backends, parallelism, and training optimizations.

Libraries & Code

comet-ml/opik

pydantic/monty

A minimal, secure Python interpreter written in Rust for use by AI.

Papers & Publications

dLLM: Simple Diffusion Language Modeling

Abstract:

Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures.

To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Abstract:

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

Deep Learning Weekly: Issue 444

Miko Planas — Thu, 26 Feb 2026 16:01:52 GMT

This week in deep learning, we bring you Gemini 3.1 Pro, A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 and a paper on Does Your Reasoning Model Implicitly Know When to Stop Thinking?.

You may also enjoy Anthropic’s Remote Control, How we caught our AI agent embezzling tokens, a paper on On Data Engineering for Scaling LLM Terminal Capabilities, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google launches Gemini 3.1 Pro, claiming more than double the reasoning performance of its predecessor on complex logic benchmarks, now rolling out across developer, enterprise, and consumer products.

Anthropic just released a mobile version of Claude Code called Remote Control

Anthropic launches Claude Code Remote Control, a new feature enabling developers to initiate coding sessions on their local terminal and seamlessly continue them from any mobile device or browser without moving code to the cloud.

Cohere Labs Launches Tiny Aya, Making Multilingual AI Accessible

Cohere Labs releases Tiny Aya, a 3.35B open-weight model claiming top multilingual performance in its size class across region-specific language variants.

Cursor agents can now control their own computers

Cursor launches cloud agents that run in isolated VMs with full computer-use capabilities, producing merge-ready PRs with video/screenshot artifacts to validate their work across web, mobile, Slack, and GitHub.

Visual imitation learning: Guidde trains AI agents on human ‘expert video’ instead of documentation

Guidde raises $50M to train AI agents on expert screen-recording videos instead of static documentation, cutting video creation time by 41% and support tickets by 34%.

The public opposition to AI infrastructure is heating up

Bipartisan opposition to AI data centers is escalating across the U.S., with states like New York proposing three-year construction moratoriums and communities pulling tax incentives, even as Big Tech commits $650B in infrastructure spending.

MLOps/LLMOps

How we caught our AI agent embezzling tokens

A PostHog engineering deep-dive into how they traced, diagnosed, and reduced their AI Wizard agent’s $6.67/run inference cost — uncovering three “token embezzlement” patterns and counterintuitive findings about context management and caching.

Learning

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

A comprehensive architectural deep-dive comparing 10 major open-weight LLM releases from January–February 2026, highlighting the convergence toward hybrid attention mechanisms and efficiency-first design across models ranging from 3B to 1T parameters.

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

An engineering blog post about how Netflix built MediaFM, its first in-house tri-modal (audio, video, text) foundation model trained on tens of millions of catalog shots to power recommendations, ad relevancy, and promotional asset optimization at scale.

Detecting and preventing distillation attacks \ Anthropic

Anthropic exposes three Chinese AI labs — DeepSeek, Moonshot, and MiniMax — for running industrial-scale “distillation attacks” that illicitly extracted Claude’s capabilities across 16M+ exchanges through ~24,000 fraudulent accounts.

Expanding our analysis of biological AI models | Epoch AI

A comprehensive Epoch AI report cataloging 1,196 biological AI models across nine categories, revealing critical biosafety gaps and landscape trends commissioned by Sentinel Bio.

Teaching AI to read a map

Google Research introduces MapTrace, a fully automated synthetic data pipeline using Gemini and Imagen models to generate 2M annotated map path examples — teaching multimodal LLMs fine-grained spatial reasoning and reducing path-tracing error by 33% on real-world benchmarks.

Libraries & Code

comet-ml/opik

vxcontrol/pentagi

Fully autonomous AI Agents system capable of performing complex penetration testing tasks

Papers & Publications

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Abstract:

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

On Data Engineering for Scaling LLM Terminal Capabilities

Abstract:

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models.

Deep Learning Weekly: Issue 443

Miko Planas — Thu, 19 Feb 2026 17:03:10 GMT

This week in deep learning, we bring you Optimizing AI IDEs at Scale, What do “economic value” benchmarks tell us? and a paper on MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents.

You may also enjoy Gemini 3 Deep Think: Advancing science, research and engineering, OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments, a paper on Thought Communication in Multiagent Collaboration, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing Claude Sonnet 4.6

Anthropic launches Claude Sonnet 4.6 as the new default model across all plans, featuring a 1M token context window, major computer use improvements, and Opus-level performance on many tasks at the same $3/$15 per million token price as Sonnet 4.5.

Gemini 3 Deep Think: Advancing science, research and engineering

Google announces a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode targeting frontier science, math, and engineering — setting new benchmark records and opening early API access to researchers and enterprises.

Alibaba unveils Qwen3.5 as China’s chatbot race shifts to AI agents

Alibaba launches Qwen 3.5 — a 397B-parameter, natively multimodal open-weight model built for agentic AI — as China’s frontier model race intensifies ahead of an expected DeepSeek release.

AI agent reliability startup Temporal raises $300M in funding

Temporal raises $300M Series D at a $5B valuation, led by a16z, to scale its open-source platform that makes AI agents fault-tolerant by logging every action and enabling automatic recovery from failures.

MLOps/LLMOps

Optimizing AI IDEs at Scale

A blog post detailing how Comet’s engineering team traced rising AI IDE spend to bloated context windows and always-on agent rules, then reduced token overhead by shrinking default context, modularizing skills, and tightening evaluation loops.

Scaling LLM Post-Training at Netflix

A technical blog post about how Netflix built an internal LLM post-training framework using Ray-based distributed orchestration to scale fine-tuning and RL workflows across multi-node GPU clusters for recommendation, search, and personalization.

Learning

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

A technical blog post about OpenEnv, an open-source agent evaluation framework, and findings from testing tool-using agents in a production-grade calendar benchmark — revealing that ambiguity and multi-step chaining, not tool selection, are the primary failure modes.

Two different tricks for fast LLM inference

A technical blog post comparing Anthropic’s and OpenAI’s “fast mode” inference approaches — low-batch-size serving vs. Cerebras wafer-scale chips — and arguing that accuracy, not raw speed, remains the dominant factor in agentic AI value.

We Extracted OpenClaw’s Memory System and Open-Sourced It (memsearch)

A technical blog post about how Zilliz extracted OpenClaw’s transparent, Markdown-based long-term memory architecture and open-sourced it as memsearch — a standalone, framework-agnostic memory library backed by Milvus vector search.

What do “economic value” benchmarks tell us? | Epoch AI

A research report analyzing three “economic value” benchmarks that measure AI performance on real-world digital work tasks, concluding that high scores signal meaningful task-level acceleration but fall short of implying end-to-end job automation.

Libraries & Code

comet-ml/opik

vercel-labs/json-render

json-render is a Generative UI framework: AI generates interfaces from natural language prompts, constrained to components you define.

Papers & Publications

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Abstract:

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

Thought Communication in Multiagent Collaboration

Abstract:

Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.

Deep Learning Weekly: Issue 442

Miko Planas — Thu, 12 Feb 2026 16:02:28 GMT

This week in deep learning, we bring you Claude Opus 4.6, Harness engineering: leveraging Codex in an agent-first world and a paper on Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.

You may also enjoy GPT-5.3-Codex, Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations, a paper on SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing Claude Opus 4.6

Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, a 1M-token context window, and industry-leading scores on Terminal-Bench 2.0, Humanity’s Last Exam, and GDPval-AA.

Introducing GPT-5.3-Codex | OpenAI

OpenAI launches GPT-5.3-Codex — its first self-bootstrapped model that helped debug its own training — combining GPT-5.2’s reasoning with frontier coding performance at 25% faster speeds.

World model startup Runway closes $315M funding round

Runway closes a $315M Series E led by General Atlantic at a $5.3B valuation, with backing from NVIDIA and AMD, to advance its world models for 3D environment generation used in robotics simulation and video production.

OpenAI upgrades its Responses API to support agent skills and a complete terminal shell

An article about OpenAI adding server-side compaction, hosted shell containers, and the open “Skills” standard to its Responses API, enabling agents to handle 5M+ token sessions without context degradation.

MLOps/LLMOps

Millions at Stake: How Melange’s High-Recall Retrieval Prevents Litigation Collapse

A case study about how patent analytics company Melange uses Pinecone’s vector database to achieve 99% recall across 600M+ documents, saving $75K annually while preventing million-dollar litigation risks from missed prior art.

Harness engineering: leveraging Codex in an agent-first world

An engineering post on how OpenAI built a million-line codebase with zero hand-written code using a 3-engineer team driving Codex agents at 3.5 PRs/engineer/day, redefining the developer role as harness design over direct coding.

‘Observational memory’ cuts AI agent costs 10x and outscores RAG on long-context benchmarks

An article about Mastra’s open-source “observational memory” architecture that uses Observer and Reflector agents to compress conversation history into stable, cacheable context — scoring 94.87% on LongMemEval while cutting token costs 10x versus traditional RAG.

Learning

Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations

A blog post on DialogLab, Google’s open-source framework for designing and testing multi-party human-AI group conversations with configurable roles, turn-taking rules, and a human-in-the-loop control mode.

What Is OpenClaw? Complete Guide to the Open-Source AI Agent

A guide to OpenClaw, the open-source, self-hosted AI agent that surpassed 175K GitHub stars in under two weeks by enabling autonomous task execution through messaging apps like WhatsApp, Telegram, and Slack.

How AI assistance impacts the formation of coding skills \ Anthropic

A randomized controlled trial showing AI coding assistance decreased skill mastery by 17% among 52 software engineers, with debugging abilities most affected despite minimal productivity gains.

Libraries & Code

comet-ml/opik

google/A2UI

A2UI is an open-source project that allows agents to generate or populate rich user interfaces.

Papers & Publications

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Abstract:

As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models’ own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Abstract:

Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent’s policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases.

Deep Learning Weekly: Issue 441

Miko Planas — Thu, 05 Feb 2026 16:01:54 GMT

This week in deep learning, we bring you Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding, Inside OpenAI’s in-house data agent and a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text.

You may also enjoy Project Genie: Experimenting with infinite, interactive worlds, Towards a science of scaling agent systems: When and why agent systems work, a paper on PaperBanana: Automating Academic Illustration for AI Scientists, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding

Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.

Project Genie: Experimenting with infinite, interactive worlds

Google launches Project Genie, an experimental world model powered by Genie 3 that lets Google AI Ultra subscribers create and explore infinite, interactive environments in real-time using text and image prompts.

Vercel rebuilt v0 to tackle the 90% problem: Connecting AI-generated code to existing production infrastructure, not prototypes

A news article reporting Vercel’s complete rebuild of v0 to address the “90% problem” where AI-generated code fails to integrate with existing production infrastructure.

Voxtral transcribes at the speed of sound. | Mistral AI

A product announcement for Mistral’s Voxtral Transcribe 2, featuring state-of-the-art speech-to-text with speaker diarization at $0.003/min and Voxtral Realtime with sub-200ms latency for live transcription.

MLOps & LLMOps

Inside OpenAI’s in-house data agent

OpenAI’s internal data agent powered by GPT-5.2 enables natural language queries across 600+ petabytes and 70,000 datasets, using multi-layered context and self-correction to deliver trustworthy analytics in minutes.

The Limit in the Loop

A blog post arguing AI memory requires active maintenance infrastructure with six core functions to prevent accumulated noise from degrading agent performance over time.

The Agent Client Protocol Overview

A technical overview of the Agent Client Protocol (ACP), an open JSON-RPC 2.0 standard that provides a common interface for editors to interact with AI coding agents.

Learning

Towards a science of scaling agent systems: When and why agent systems work

A research article presenting Google’s evaluation of 180 agent configurations, revealing multi-agent systems boost parallelizable tasks by 81% but degrade sequential tasks by 70%.

Moltbook: After The First Weekend - by Scott Alexander

Scott Alexander examines whether Moltbook AI activity is “real” or “roleplay” by evaluating external causes and effects.

The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?

A research article from Anthropic finding AI failures increasingly stem from incoherence rather than systematic misalignment as tasks grow harder, suggesting future risks resemble industrial accidents more than coherent goal pursuit.

Libraries & Code

comet-ml/opik

jezweb/claude-skills

Skills for Claude Code CLI such as full stack dev Cloudflare, React, Tailwind v4, and AI integrations.

Papers & Publications

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

PaperBanana: Automating Academic Illustration for AI Scientists

Abstract:

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Abstract:

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

Deep Learning Weekly: Issue 440

Miko Planas — Thu, 29 Jan 2026 16:02:09 GMT

This week in deep learning, we bring you Terminally online Mistral Vibe., ATLAS: Practical scaling laws for multilingual models and a paper on GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization.

You may also enjoy Moonshot AI releases open-source Kimi K2.5 model with 1T parameters, The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language, a paper on Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Terminally online Mistral Vibe.

Mistral launches Vibe 2.0, a terminal-native coding agent powered by Devstral 2, featuring custom subagents, multi-choice clarifications, and slash-command skills.

Moonshot AI releases open-source Kimi K2.5 model with 1T parameters

Moonshot AI releases open-source Kimi K2.5, a 1 trillion parameter mixture-of-experts model trained on 15 trillion tokens that outperforms GPT-5.2 on several benchmarks including the challenging HLE-Full evaluation.

Node-based design tool Flora raises $42M from Redpoint Ventures

Flora, an AI-powered design platform, raises $42M Series A led by Redpoint Ventures to democratize creative workflows through multimodal generative AI and infinite canvas collaboration.

Go Deep - Amp

Amp launches “deep” mode powered by GPT-5.2-Codex, a highly autonomous coding agent that silently researches codebases for 5-15 minutes before making changes, complementing their interactive “smart” mode for different workflow needs.

NVIDIA Launches Earth-2 Family of Open Models — the World’s First Fully Open, Accelerated Set of Models and Tools for AI Weather

NVIDIA launches Earth-2 family of open weather AI models—the world’s first fully open, accelerated weather forecasting stack—offering models for 15-day global forecasts, local storm prediction, and data assimilation that run up to 500x faster than traditional physics-based methods.

Open Coding Agents: Fast, accessible coding agents that adapt to any repo | Ai2

Allen Institute for AI launches Open Coding Agents featuring SERA, an open-source coding agent, enabling repository-specific specialization where 32B models match 100B+ teachers on private codebases.

MLOps & LLMOps

The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language

A technical blog post detailing Netflix’s implementation of LLM-powered natural language search for their Graph Search platform, transforming structured GraphQL queries into intuitive text-based interfaces for enterprise data discovery.

Learning

ATLAS: Practical scaling laws for multilingual models

Google Research introduces ATLAS (Adaptive Transfer Scaling Laws), the largest public multilingual pre-training study with 774 training runs across 400+ languages.

Arcee AI | Trinity Large: An Open 400B Sparse MoE Model

A technical deep-dive on Arcee AI’s Trinity Large, a 400B parameter sparse MoE model with 13B active parameters achieving frontier-class performance at 2-3x faster inference than peers, trained in 33 days for $20M total cost.

AI open models have benefits. So why aren’t they more widely used?

A research article examining why open AI models, despite achieving 90% of closed-model performance at 87% lower cost, account for only 20% of usage while closed models dominate most of the market.

Libraries & Code

comet-ml/opik

aiming-lab/SimpleMem

Efficient Lifelong Memory for LLM Agents

Papers & Publications

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Abstract:

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Abstract:

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.

Deep Learning Weekly: Issue 439

Miko Planas — Thu, 22 Jan 2026 16:03:07 GMT

This week in deep learning, we bring you FLUX.2 [klein], Heaps do lie: debugging a memory leak in vLLM. and a paper on Toward Efficient Agents: Memory, Tool learning, and Planning.

You may also enjoy Personal Intelligence: Connecting Gemini to Google apps, How We Built a Semantic Highlight Model To Save Token Cost for RAG, a paper on ShapeR: Robust Conditional 3D Shape Generation from Casual Captures, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

FLUX.2 [klein]: Towards Interactive Visual Intelligence | Black Forest Labs

Black Forest Labs launches FLUX.2 [klein], a unified image generation and editing model achieving sub-0.5s inference on consumer GPUs (13GB VRAM) while matching models 5x its size in quality.

Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes

A virtual hackathon that focuses on shipping LLM-powered apps that turn New Year’s resolutions into measurable outcomes across six impact categories.

Personal Intelligence: Connecting Gemini to Google apps

Google launches Personal Intelligence beta for Gemini, connecting Gmail, Photos, YouTube, and Search with one tap to enable contextual, personalized AI assistance.

Veo 3.1 Ingredients to Video: More consistency, creativity and control

Google announces Veo 3.1 “Ingredients to Video” update featuring native vertical video generation, improved character consistency, and state-of-the-art upscaling for mobile-first content creation.

Preply raises $150M to enhance human-led language learning with AI

Language learning marketplace Preply raises $150M Series D at $1.2B valuation to scale AI-enhanced human tutoring.

OpenAI quietly launches ChatGPT Translate with support for 25 languages

OpenAI quietly launches ChatGPT Translate as a free, standalone web prototype supporting 25 languages, targeting student learning, business documents, and travel use cases.

MLOps & LLMOps

Heaps do lie: debugging a memory leak in vLLM.

An engineering deep-dive documenting Mistral AI’s investigation of a 400 MB/minute memory leak in vLLM during disaggregated serving, ultimately traced to UCX’s mmap hooking mechanism interfering with Python’s memory allocator..

gRPC as a custom transport for MCP

A technical blog post explaining Google Cloud’s initiative to enable gRPC as a native transport for Model Context Protocol – eliminating transcoding overhead, enabling bidirectional streaming, and more.

Files Are All You Need

A blog post arguing that files and filesystems are emerging as the core abstraction for agentic AI, with agents using ~5-10 tools (CLI, code interpreter, web fetch) operating on files proving more general than agents with 100+ MCP tools.

Learning

Token Optimization Strategies for AI Agents

A practical guide to reducing LLM token consumption in agentic systems by up to 75% through model selection, prompt caching, context optimization, and structured outputs.

LLM Context Pruning: Improving RAG and Agentic AI Systems

A technical guide explaining context pruning for RAG systems, introducing Provence as a lightweight cross-encoder that performs document-level reranking and sentence-level pruning simultaneously.

How We Built a Semantic Highlight Model To Save Token Cost for RAG

A technical blog post detailing an open-source bilingual semantic highlight model that achieves 70-80% token cost reduction for RAG systems by identifying semantically relevant sentences.

Libraries & Code

comet-ml/opik

MemTensor/MemOS

MemOS is a Memory Operating System for LLMs and AI agents that unifies store / retrieve / manage for long-term memory, enabling context-aware and personalized interactions with KB, multi-modal, tool memory, and enterprise-grade optimizations built in.

Papers & Publications

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Abstract:

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

Toward Efficient Agents: Memory, Tool learning, and Planning

Abstract:

Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.

Deep Learning Weekly: Issue 438

Miko Planas — Thu, 15 Jan 2026 16:02:53 GMT

This week in deep learning, we bring you Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes, Claude Cowork and a paper on Prompt Repetition Improves Non-Reasoning LLMs.

You may also enjoy Sakana AI Agent Wins AtCoder Heuristic Contest, Use multiple models, a paper on Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes

A virtual hackathon that focuses on shipping LLM-powered apps that turn New Year’s resolutions into measurable outcomes across six impact categories.

Introducing Cowork | Claude

Anthropic launches Cowork research preview, extending Claude Code’s agentic capabilities to non-coding workflows for Claude Max subscribers on macOS.

Sakana AI Agent Wins AtCoder Heuristic Contest (First AI to Place 1st)

Sakana AI’s ALE-Agent became the first AI to win a competitive programming contest, defeating 804 human participants by discovering novel optimization algorithms.

OpenAI buys Torch to bring unified medical data into ChatGPT Health

OpenAI acquires Torch to integrate unified medical data aggregation into ChatGPT Health, consolidating fragmented patient records from multiple healthcare providers into a single AI-powered interface.

New tech and tools for retailers to succeed in an agentic shopping era

Google launches Universal Commerce Protocol (UCP) with Shopify, Target, Walmart and 20+ partners, enabling AI-powered checkout in Search, branded Business Agent chatbots, and Direct Offers for personalized discounts.

MLOps & LLMOps

Building Agents with the Gemini Interactions API

A practical guide about building AI agents using Google’s Gemini Interactions API (Beta), demonstrating how server-side state management simplifies agent development from basic chatbots to multi-turn CLI agents in under 100 lines of code.

Best practices for coding with agents · Cursor

A comprehensive guide about maximizing productivity with Cursor’s AI coding agents through planning workflows, context management, parallel execution, and iterative debugging strategies.

Learning

MIPRO: The Optimizer That Brought Science to Prompt Engineering

An article about MIPRO (Multiprompt Instruction Proposal Optimizer), achieving up to 13% better performance than hand-crafted prompts.

Use multiple models - by Nathan Lambert

An article about the emerging multi-model workflow strategy for AI power users in 2026, where switching between GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro for different tasks yields better results than relying on any single model due to uneven “jagged” capabilities.

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR

An article about Google’s MedGemma 1.5 4B update adding 3D medical imaging interpretation and MedASR speech transcription model, launching alongside a $100,000 Kaggle hackathon for healthcare AI applications.

Supercharging LLMs: Scalable RL with torchforge and Weaver

A technical post about Meta’s torchforge RL library achieving 4x faster training on 512 GPUs when combined with Stanford’s Weaver verifier system, capturing 44-65% of supervised learning performance without requiring human annotations.

Libraries & Code

comet-ml/opik

Aider-AI/aider

AI pair programming in your terminal

Papers & Publications

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Abstract:

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone’s early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

Prompt Repetition Improves Non-Reasoning LLMs

Abstract:

When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.

Deep Learning Weekly: Issue 437

Miko Planas — Thu, 08 Jan 2026 16:02:39 GMT

This week in deep learning, we bring you Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes, 2025: The year in LLMs by Simon Willison, and a paper on NitroGen: An Open Foundation Model for Generalist Gaming Agents.

You may also enjoy TII’s Falcon H1R 7B can out-reason models up to 7x its size, The importance of Agent Harness in 2026, a paper on State of AI | An Empirical 100 Trillion Token Study with OpenRouter, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes

Kicking off Jan 13, the virtual hackathon focuses on shipping LLM-powered apps that turn New Year’s resolutions into measurable outcomes across six impact categories.

Nvidia’s Cosmos Reason 2 aims to bring reasoning VLMs into the physical world

NVIDIA released Cosmos Reason 2, an open-source reasoning vision-language model that enables robots and AI agents to understand and navigate the physical world.

TII’s Falcon H1R 7B can out-reason models up to 7x its size

Technology Innovation Institute launched Falcon H1R 7B, a reasoning model using hybrid Transformer-Mamba architecture that matches or outperforms competitors 2-7x its size through architectural efficiency and test-time scaling.

MLOps & LLMOps.

Multi-Agent Systems: The Architecture Shift from Monolithic LLMs to Collaborative Intelligence

An architecture guide explaining the evolution from monolithic LLM prompts to multi-agent systems, covering architectural philosophies, cognitive patterns, and production challenges.

The importance of Agent Harness in 2026

A technical blog post arguing that Agent Harnesses—the infrastructure layer managing long-running AI tasks—will become critical in 2026 as model differentiation shifts from benchmark performance to durability over hundreds of tool calls.

Learning

2025: The year in LLMs by Simon Willison

A comprehensive year-in-review blog post covering 24 major LLM trends in 2025, including reasoning models’ emergence, coding agents’ $1B revenue milestone, Chinese models dominating open-weight rankings, and more.

The creator of Claude Code just revealed his workflow, and developers are losing their minds

An article covering Claude Code creator’s development workflow, demonstrating how parallel AI agents, verification loops, and more enable a single developer to achieve output comparable to an entire engineering team.

Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

A technical blog post introducing NVIDIA’s Llama Nemotron VL 1B models—a 1.7B parameter multimodal embedding and reranking system.

Towards Generalizable and Efficient Large-Scale Generative Recommenders

An article detailing Netflix’s approach to scaling generative recommendation models from 50M to 1B parameters, achieving substantial improvements through novel scaling laws, efficiency optimizations, and alignment strategies that address the unique challenges of recommendation systems.

The State Of LLMs 2025: Progress, Problems, and Predictions by Sebastian Raschka

A comprehensive technical review analyzing 2025 LLM developments through the lens of training methodologies, architectural evolution, and practical applications.

Why Stochastic Rounding is Essential for Modern Generative AI

A technical blog post explaining how stochastic rounding solves vanishing gradient problems in low-precision AI training, enabling models to train effectively in FP8 and 4-bit formats.

Libraries & Code

comet-ml/opik

neuml/txtai

All-in-one AI framework for semantic search, LLM orchestration and language model workflows

Papers & Publications

State of AI | An Empirical 100 Trillion Token Study with OpenRouter

Abstract:

The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella “Glass Slipper” effect. These findings underscore that the way developers and end-users engage with LLMs “in the wild” is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.

NitroGen: An Open Foundation Model for Generalist Gaming Agents

Abstract:

We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.

Recursive Language Models

Abstract:

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

Deep Learning Weekly: Issue 436

Miko Planas — Thu, 01 Jan 2026 16:25:58 GMT

This week in deep learning, we bring you GPT-5.2-Codex, Prompt Drift: The Hidden Failure Mode Undermining Agentic Systems, and a paper on Adaptation of Agentic AI.

You may also enjoy Mistral OCR 3, 2025 LLM Year in Review | karpathy, a paper on: From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing GPT-5.2-Codex | OpenAI

An introduction to GPT-5.2-Codex, OpenAI’s most advanced agentic coding model optimized for complex software engineering and defensive cybersecurity.

Introducing Mistral OCR 3

An introduction to Mistral OCR 3, achieving 74% win rate over its predecessor with state-of-the-art accuracy on forms, handwriting, and complex tables.

Introducing Runway GWM-1

Runway announces a real-time General World Model family with three variants for explorable environments, interactive characters, and robotic manipulation.

Meta Platforms buys Manus to bolster its agentic AI skillset

Meta acquires Singapore-based Manus, a general-purpose AI agent that reached $100M ARR in just eight months.

MLOps & LLMOps.

Prompt Drift: The Hidden Failure Mode Undermining Agentic Systems

A blog post explaining prompt drift and how it undermines multi-step agentic systems through subtle reasoning degradation rather than clean failures.

We removed 80% of our agent’s tools

A case study about how Vercel simplified their internal text-to-SQL agent (d0) by removing 80% of specialized tools and replacing them with a single bash command execution tool.

Agents Meet Databases: The Future of Agentic Architectures

A MongoDB article exploring two architectural paths for connecting AI agents to databases: standardized MCP servers versus custom LangChain integrations, with emphasis on accuracy, security, and performance trade-offs.

Learning

2025 LLM Year in Review | karpathy

A technical retrospective by Andrej Karpathy identifying six paradigm shifts in LLMs during 2025, including the rise of reinforcement learning from verifiable rewards, the emergence of “vibe coding,” and new AI interaction paradigms like Claude Code.

Measuring no CoT math time horizon (single forward pass)

A research article measuring AI models’ ability to solve math problems without chain-of-thought reasoning.

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

A technical deep dive introducing NVIDIA Nemotron 3’s hybrid Mamba-Transformer MoE architecture with native 1M-token context and multi-environment RL training.

Libraries & Code

comet-ml/opik

github/spec-kit

An open source toolkit that allows you to focus on product scenarios and predictable outcomes instead of vibe coding every piece from scratch.

Papers & Publications

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

Abstract:

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

Adaptation of Agentic AI

Abstract:

Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.

Deep Learning Weekly: Issue 435

Miko Planas — Thu, 18 Dec 2025 16:02:52 GMT

This week in deep learning, we bring you Announcing the Future of AI Engineering: Self-Optimizing Agents, Fantastic Bugs and Where to Find Them in AI Benchmarks, and a paper on Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes.

You may also enjoy Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation, Letta Code: A Memory-First Coding Agent, a paper on Evaluating AI’s ability to perform scientific research tasks, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3 Flash: frontier intelligence built for speed

Google announced Gemini 3 Flash, achieving 90.4% on GPQA Diamond and 78% on SWE-bench Verified while being 3x faster than Gemini 2.5 Pro at $0.50 per million input tokens.

Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation

Meta announces SAM Audio, the first unified multimodal model enabling intuitive audio separation through text, visual, or temporal prompts, achieving state-of-the-art performance..

NVIDIA Debuts Nemotron 3 Family of Open Models

NVIDIA launches the Nemotron-3 family of open-source AI models, offering developers new tools for building and deploying customizable language models across various applications.

Interactions API: A unified foundation for models and agents

Google launches Interactions API, featuring server-side state management, background execution, and access to Gemini Deep Research agent.

Introducing OpenSearch 3.4

OpenSearch announces version 3.4 release, introducing new features and improvements to the open-source search and analytics suite for enhanced performance, security, and developer experience.

New method enables small language models to solve complex reasoning tasks

MIT CSAIL researchers develop a training method that enables small language models to perform complex reasoning tasks by learning to generate internal “thought” processes, achieving comparable results to much larger models.

MLOps & LLMOps.

Announcing the Future of AI Engineering: Self-Optimizing Agents

A blog post exploring how self-optimizing agents use continuous evaluation and feedback loops to automatically improve prompts, tools, and behaviors over time, moving beyond static agent design toward systems that learn and adapt in production.

Letta Code: A Memory-First Coding Agent

A blog post introducing Letta Code, a memory-first coding agent that ranks #1 among model-agnostic open-source harnesses on TerminalBench.

AISAQ in Milvus: Billion-Scale Vector Search Just Got 3,200× Cheaper on Memory

A technical article introducing AISAQ, a disk-based vector index achieving 3,200× memory reduction (32 GB to 10 MB) for billion-scale vector search by storing all data on SSD with optimized layouts.

Learning

Fantastic Bugs and Where to Find Them in AI Benchmarks

An article introducing a measurement-theoretic framework that identifies flawed questions in AI benchmarks with up to 84% precision, detecting issues across nine widely used datasets.

How to Build Privacy-Preserving Evaluation Benchmarks with Synthetic Data

A technical tutorial demonstrating how to build privacy-preserving AI evaluation benchmarks using NVIDIA NeMo Data Designer and NeMo Evaluator to generate synthetic datasets.

How to Fine-Tune an LLM on NVIDIA GPUs With Unsloth

A guide about fine-tuning LLMs using Unsloth on NVIDIA DGX Cloud and Spark, demonstrating how to customize AI models for specific tasks with improved performance and efficiency.

Libraries & Code

comet-ml/opik

andrewyng/aisuite

Simple, unified interface to multiple Generative AI providers

Papers & Publications

Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes

Abstract:

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

Evaluating AI’s ability to perform scientific research tasks

Abstract:

We introduce FrontierScience, a benchmark evaluating AI capabilities for expert-level scientific reasoning. FrontierScience consists of two tracks: (1) Olympiad, which contains international olympiad problems (at the level of IPhO, IChO, and IBO), and (2) Research, which contains PhD-level, open-ended problems representative of sub-problems in scientific research. In total, FrontierScience is composed of several hundred questions (160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. In contrast, all Olympiad problems are originally produced by international olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we also introduce a granular rubric-based architecture to evaluate model capabilities throughout the process of solving a research task, as opposed to judging a standalone answer. In initial evaluations of several frontier models, GPT-5.2 is the top performing model on FrontierScience, scoring 77% on the Olympiad set and 25% on the Research set.

Deep Learning Weekly: Issue 434

Miko Planas — Thu, 11 Dec 2025 16:00:36 GMT

This week in deep learning, we bring you Introducing: Devstral 2 and Mistral Vibe CLI, AI Agent Orchestration Flows, and a paper on Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.

You may also enjoy MCP support for Apigee, Claude Agent Skills: A First Principles Deep Dive, a paper on DeepCode: Open Agentic Coding, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing: Devstral 2 and Mistral Vibe CLI

Mistral released Devstral 2, a state-of-the-art open-source coding model achieving 72.2% on SWE-bench Verified, alongside Mistral Vibe CLI.

MCP support for Apigee

Google Cloud announces Model Context Protocol (MCP) support in Apigee, allowing developers to turn existing APIs into secure, governed agentic tools without code changes or managing MCP servers.

Claude Code is coming to Slack, and that’s a bigger deal than it sounds

Anthropic launches Claude Code in Slack beta, letting developers delegate complete coding workflows directly from chat threads.

OpenAI to acquire Neptune

OpenAI has entered into a definitive agreement to acquire neptune.ai, strengthening the tools and infrastructure that support progress in frontier research.

Multimodal AI provider fal nabs $140M amid rapid growth

Multimodal AI startup fal raised a $140 million series D led by Sequoia, growing revenue by 300% since July with 600+ AI models for image, audio, and video generation.

Oboe raises $16 million from a16z for its AI-powered course generation platform

Oboe, a learning startup from Anchor co-founders and former Spotify execs Nir Zicherman and Michael Mignano, has raised $16 million in Series A funding led by a16z.

MLOps & LLMOps.

AI Agent Orchestration Flows

An explanatory post defining agent orchestration as the architectural layer that manages non-deterministic control flow and the iterative Thought-Action-Observation cycle.

Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

A technical blog post detailing the top five AI model optimization techniques to improve inference speed, TCO, and scalability on NVIDIA GPUs.

Learning

Claude Agent Skills: A First Principles Deep Dive

An article analyzing Claude’s Agent Skills system as a prompt-based meta-tool architecture that modifies the conversation and execution contexts by injecting hidden instructions and changing tool permissions,

The AI churn wave?

A post investigating the low gross and net revenue retention rates among AI-native companies, identifying an “AI tourist problem” especially pronounced in low-priced products that see GRR as low as 23%.

A Technical Tour of the DeepSeek Models from V3 to V3.2

A technical article detailing the architectural evolution of DeepSeek flagship models from V3 to V3.2, focusing on the efficiency gains from DeepSeek Sparse Attention and the implementation of self-verification for improved reasoning capabilities.

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

An article about validating LLM judges under rating indeterminacy, proposing a framework that uses response set elicitation and multi-label agreement metrics to better select judges for evaluation tasks when multiple interpretations are valid.

Libraries & Code

comet-ml/opik

block/goose

A local, extensible, open source AI agent that automates engineering tasks

Papers & Publications

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Abstract:

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the “scale-at-all-costs” paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

DeepCode: Open Agentic Coding

Abstract:

Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics.

By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.

Deep Learning Weekly: Issue 433

Miko Planas — Thu, 04 Dec 2025 16:03:08 GMT

This week in deep learning, we bring you Introducing Mistral 3, MCP Explorer, and a paper on Agentic Bridge Framework: Closing the Gap Between Agentic Capability and Performance Benchmarks.

You may also enjoy Laying the Foundations for Visual Intelligence, 8 learnings from 1 year of agents, a paper on SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing Mistral 3

The Mistral team announced Mistral 3, which includes three state-of-the-art small models and Mistral Large 3 — a sparse mixture-of-experts model.

Laying the Foundations for Visual Intelligence—$300M Series B | Black Forest Labs

Black Forest Labs raises $300M Series B at $3.25B valuation to advance visual intelligence models beyond its popular FLUX image generation platform.

New training method boosts AI multimodal reasoning with smaller, smarter datasets

Researchers at MiroMind AI and several Chinese universities have released OpenMMReasoner, a training framework that improves the capabilities of models in multimodal reasoning.

Skill Learning: Bringing Continual Learning to CLI Agents

The Letta team released Skill Learning, a way for Letta Code to dynamically learn skills over time.

MLOps & LLMOps.

MCP Explorer

An educational project for learning Anthropic’s Model Context Protocol through a narrative-driven and interactive learning experience.

8 learnings from 1 year of agents

A detailed retrospective blog post sharing 8 key learnings from a year of developing PostHog AI, focusing on architectural choices like using a single LLM loop and the power of continuous model improvements.

OpenSearch as an agentic memory solution: Building context-aware agents using persistent memory

A blog post that explores the memory challenges facing AI agents, introduces agentic memory’s core concepts, and demonstrates how to integrate it with your agent frameworks.

Build and Run Secure, Data-Driven AI Agents

A technical guide detailing the deployment of NVIDIA’s AI-Q Research Assistant and Enterprise RAG Blueprints, which use Nemotron NIMs and an agentic Plan-Refine-Reflect workflow on Amazon EKS.

Learning

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

An alignment report evaluating techniques like fine-tuning and prompting to improve AI honesty and detect lies across five specialized testbed models.

A Rosetta Stone for AI benchmarks

A statistical paper introducing a “Rosetta Stone” framework that stitches together around 40 different AI benchmarks to rigorously measure long-run capability trends and forecast future algorithmic progress.

Custom instructions with AGENTS.md

A guide that shows you how to understand how Codex discovers persistent guidance, author global and per-project instruction files, and verify that Codex honors your setup during real CLI runs.

Following the Text Gradient at Scale

A blog post introducing Feedback Descent, a learning paradigm that uses rich, textual feedback instead of scalar rewards to guide iterative improvements in domains like molecular design and prompt optimization.

Libraries & Code

comet-ml/opik

Papers & Publications

Agentic Bridge Framework: Closing the Gap Between Agentic Capability and Performance Benchmarks

Abstract:

While agentic AI systems perform impressively on emerging capability benchmarks, existing performance evaluation suites focus on non-agentic workloads, leaving a critical gap in understanding system efficiency for multi-step, tool-using agents. We present the Agentic Bridge Framework for extracting actionable performance insights from capability evaluations through trace-level telemetry. Applying this framework to a multi-agent system on GAIA validation, we reveal that: (1) pass@N strategies provide diminishing accuracy returns; (2) search agents dominate token usage and latency, identifying web data gathering as the primary bottleneck; (3) reasoning models spend more tokens on context preservation than actual reasoning, highlighting costly inter-agent communication overhead. These findings inform critical design choices—context engineering, tool-use optimization, and phase-aware resource allocation—and illustrate how agent traces can inform reproducible performance workloads, bridging capability achievements with systems optimization for efficient agentic AI.

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Abstract:

Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model’s ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.