<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Deep Learning Weekly]]></title><description><![CDATA[Bringing you everything new and exciting in the world of  deep learning from academia to the grubby depths  of industry every week right to your inbox.]]></description><link>https://www.deeplearningweekly.com</link><image><url>https://substackcdn.com/image/fetch/$s_!yiM2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63609b6-c5bb-426a-a5c1-b6ce9d56b51e_468x468.png</url><title>Deep Learning Weekly</title><link>https://www.deeplearningweekly.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 14 Jun 2026 11:14:47 GMT</lastBuildDate><atom:link href="https://www.deeplearningweekly.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Deep Learning Weekly]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[deeplearningweekly@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[deeplearningweekly@substack.com]]></itunes:email><itunes:name><![CDATA[Deep Learning Weekly]]></itunes:name></itunes:owner><itunes:author><![CDATA[Deep Learning Weekly]]></itunes:author><googleplay:owner><![CDATA[deeplearningweekly@substack.com]]></googleplay:owner><googleplay:email><![CDATA[deeplearningweekly@substack.com]]></googleplay:email><googleplay:author><![CDATA[Deep Learning Weekly]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Deep Learning Weekly: Issue 459]]></title><description><![CDATA[Claude Fable 5, Cohere&#8217;s North Mini Code, a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-459</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-459</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Fri, 12 Jun 2026 15:00:51 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Claude Fable 5</a>, <a href="https://huggingface.co/blog/CohereLabs/introducing-north-mini-code">North Mini Code: Cohere&#8217;s First Model For Developers</a> and <a href="https://arxiv.org/abs/2601.14470">a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering</a>.</p><p>You may also enjoy <a href="https://research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/">NVIDIA Nemotron 3 Ultra</a>, <a href="https://epoch.ai/gradient-updates/controlling-the-capital-after-agi">Controlling the capital after AGI</a>, <a href="https://arxiv.org/abs/2602.07055">a paper on Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Claude Fable 5 and Claude Mythos 5 \ Anthropic</a></strong></p><p>Anthropic launches Claude Fable 5, a general-access Mythos-class model at $10/$50 per M tokens, with classifier-based fallbacks to Opus 4.8 for cyber, bio, and distillation queries.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/">Introducing Gemma 4 12B: a unified, encoder-free multimodal model</a></strong></p><p>Google releases Gemma 4 12B, an encoder-free multimodal model that runs on 16GB VRAM and processes vision and audio natively through the LLM backbone</p><p><strong><a href="https://research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/">NVIDIA Nemotron 3 Ultra</a></strong></p><p>NVIDIA releases Nemotron 3 Ultra &#8212; 550B total / 55B active MoE hybrid Mamba-Transformer, pretrained in NVFP4, with up to 5.9x throughput over competing open MoEs and 1M token context.</p><p><strong><a href="https://openai.com/index/openai-submits-confidential-s-1/">Confidential submission of draft S-1 to the SEC | OpenAI</a></strong></p><p>OpenAI files a confidential S-1 with the SEC, preemptively announcing it publicly ahead of an expected leak &#8212; while noting IPO timing remains undecided as some strategic moves are easier as a private company.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://cloud.google.com/blog/products/devops-sre/how-google-sre-is-using-agentic-ai-to-improve-operations">How Google SRE is using agentic AI to improve operations</a></strong></p><p>An article about how Google SRE is wiring agentic AI across the full incident lifecycle &#8212; dynamic anomaly detection, autonomous investigation, and a RAG layer over historical incidents to inform mitigation agents.</p><p><strong><a href="https://research.google/blog/unlocking-dependable-responses-with-gemini-enterprise-agent-platforms-agentic-rag/">Unlocking dependable responses with Gemini Enterprise Agent Platform&#8217;s Agentic RAG</a></strong></p><p>Google launches an agentic RAG framework on Gemini Enterprise Agent Platform with a Sufficient Context Agent that iterates until retrieval gaps are filled, hitting 90.1% accuracy on multi-hop queries &#8212; up to 34% over vanilla RAG.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://huggingface.co/blog/CohereLabs/introducing-north-mini-code">North Mini Code: Cohere&#8217;s First Model For Developers</a></strong></p><p>An article about how Cohere trained North Mini Code&#8217;s agentic coding capabilities &#8212; using cascaded SFT as an RLVR primer, joint multi-environment RL across 70k containerized repos, and cross-harness data mixing to generalize across SWE-Agent, mini-SWE-agent, and OpenCode scaffolds.</p><p><strong><a href="https://deepmindsafetyresearch.medium.com/testing-gemini-models-for-scheming-tendencies-3368c013ff16">Testing Gemini models for scheming tendencies | by DeepMind Safety Research</a></strong></p><p>DeepMind releases two scheming eval frameworks for Gemini &#8212; Gram (simulated agentic environments) and honeypots (real safety codebases) &#8212; finding 2&#8211;3% unprompted sabotage rates and no coherent misalignment.</p><p><strong><a href="https://www.interconnects.ai/p/claude-fable-5-and-new-ai-safety">Claude Fable 5 and new AI safety fables</a></strong></p><p>Nathan Lambert argues that Anthropic&#8217;s undisclosed safety filters in Claude Fable 5 &#8212; which silently degrade responses for frontier AI research without notifying users &#8212; are competitive entrenchment dressed as safety policy.</p><p><strong><a href="https://epoch.ai/gradient-updates/controlling-the-capital-after-agi">Controlling the capital after AGI</a></strong></p><p>An analytical piece from Epoch AI taxonomizing post-AGI wealth redistribution proposals &#8212; UBI, UBS, UBC, and sovereign wealth funds &#8212; along a single axis: how much control over capital, not just income, each scheme grants citizens.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2601.14470">Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering</a></strong></p><p><strong>Abstract:</strong></p><p>LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages.</p><p>Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.</p><p><strong><a href="https://arxiv.org/abs/2602.07055">Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?</a></strong></p><p><strong>Abstract:</strong></p><p>Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent&#8217;s ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 458]]></title><description><![CDATA[Claude Opus 4.8, Agent Tracing and Observability: Log & Debug Complex AI Systems, a paper on A Self-Healing Framework for Reliable LLM-Based Autonomous Agents, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-458</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-458</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 04 Jun 2026 15:01:49 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.anthropic.com/news/claude-opus-4-8">Claude Opus 4.8</a>, <a href="https://www.comet.com/site/blog/ai-agent-tracing/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-agent-tracing/">Agent Tracing and Observability: Log &amp; Debug Complex AI Systems</a> and <a href="https://arxiv.org/abs/2605.06737">a paper on A Self-Healing Framework for Reliable LLM-Based Autonomous Agents</a>.</p><p>You may also enjoy <a href="https://www.minimax.io/blog/minimax-m3">Minimax M3</a>, <a href="https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots">Direct Preference Optimization Beyond Chatbots</a>, <a href="https://arxiv.org/abs/2505.17117">a paper on From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-8">Introducing Claude Opus 4.8 \ Anthropic</a></strong></p><p>Anthropic ships Claude Opus 4.8 at the same price as 4.7 &#8212; adding benchmark gains across coding and agentic tasks, a huge reduction in unremarked code flaws, and cheaper fast mode.</p><p><strong><a href="https://openai.com/index/codex-for-every-role-tool-workflow/">Codex for every role, tool, and workflow</a></strong></p><p>OpenAI expands Codex beyond developers with six role-specific plugins covering 62 apps and 110 skills, a preview of shareable hosted Sites, and inline annotations.</p><p><strong><a href="https://mistral.ai/news/search-toolkit/">Introducing Search Toolkit</a></strong></p><p>Mistral launches Search Toolkit, an open-source composable framework unifying ingestion, retrieval, and evaluation into a single production-ready pipeline for enterprise RAG and search applications.</p><p><strong><a href="https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/">OpenAI frontier models and Codex are now available on AWS</a></strong></p><p>OpenAI makes its frontier models and Codex generally available on AWS via Amazon Bedrock &#8212; including GovCloud regions &#8212; letting enterprises adopt OpenAI through existing AWS security, compliance, and procurement workflows.</p><p><strong><a href="https://www.minimax.io/blog/minimax-m3">MiniMax M3: Frontier Coding, 1M Context, Native Multimodality &#8212; All in One Model</a></strong></p><p>MiniMax launches M3, currently the only open-weight model combining frontier coding performance, native multimodality, and a 1M-token context window via a new sparse attention architecture.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/ai-agent-tracing/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-agent-tracing/">Agent Tracing and Observability: Log &amp; Debug Complex AI Systems</a></strong></p><p>A guide on instrumenting agent tracing for multi-agent systems, covering why flat logging breaks at coordination boundaries, the three structural pillars of agentic observability, and how self-evolving agents compound debugging complexity.</p><p><strong><a href="https://cursor.com/blog/cloud-agent-lessons">What we&#8217;ve learned building cloud agents</a></strong></p><p>A Cursor engineering retrospective on a year of shipping cloud agents, arguing the work is less &#8220;local agent on a server&#8221; and more building a full operating layer &#8212; covering environment fidelity, durable execution via Temporal, etc.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://epoch.ai/data-insights/open-closed-eci-gap">Open models lag state-of-the-art closed models by 4 months</a></strong></p><p>An Epoch AI data insight measuring the open-to-closed model capability gap using their Epoch Capabilities Index (ECI), finding open-weight models now lag frontier closed models by an average of 4 months.</p><p><strong><a href="https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack">What we learned mapping a year&#8217;s worth of AI-enabled cyber threats \ Anthropic</a></strong></p><p>Anthropic analyzed 832 banned malicious accounts over one year, finding that AI is accelerating cyberattack sophistication &#8212; shifting from initial access tactics to post-compromise operations.</p><p><strong><a href="https://zilliz.com/blog/what-is-a-vector-lakebase">What Is a Vector Lakebase?</a></strong></p><p>An explainer introducing the Vector Lakebase &#8212; an architecture that unifies vector-database-grade serving with open lake storage and a shared semantic layer.</p><p><strong><a href="https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots">Direct Preference Optimization Beyond Chatbots</a></strong></p><p>A blog post on applying Direct Preference Optimization to structured OCR &#8212; not for chat alignment &#8212; by using the SFT model&#8217;s own degeneration failures as rejection pairs.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2605.06737">A Self-Healing Framework for Reliable LLM-Based Autonomous Agents</a></strong></p><p><strong>Abstract:</strong></p><p>Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability-aware self-healing framework for LLM-based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of failure types and introduce a quantitative reliability assessment model. Next, we propose a failure detection method that identifies abnormal agent behavior based on execution patterns and output consistency. Finally, we design a self-healing mechanism that dynamically recovers from failures through adaptive replanning and corrective prompting strategies. The proposed framework was implemented in a multi-agent workflow environment and evaluated using real-world task scenarios. Experimental results demonstrate that our approach significantly increases task success rates, reduces failure propagation, and enhances overall system robustness compared to existing methods. In particular, this study distinguishes itself by establishing an integrated monitoring system that combines the agent&#8217;s internal reasoning process with external execution results. These findings are expected to contribute to securing the stability of advanced autonomous systems and lowering the barriers to LLM adoption in production environments.</p><p><strong><a href="https://arxiv.org/abs/2505.17117">From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning</a></strong></p><p><strong>Abstract:</strong></p><p>Humans organize knowledge into compact conceptual categories that balance compression with semantic richness. Large Language Models (LLMs) exhibit impressive linguistic abilities, but whether they navigate this same compression-meaning trade-off remains unclear. We apply an Information Bottleneck framework to compare human conceptual structure with embeddings from 40+ LLMs using classic categorization benchmarks. We find that LLMs broadly align with human category boundaries, yet fall short on fine-grained semantic distinctions. Unlike humans, who maintain ``inefficient&#8217;&#8216; representations that preserve contextual nuance, LLMs aggressively compress, achieving more optimal information-theoretic compression at the cost of semantic richness. Surprisingly, encoder models outperform much larger decoder models in human alignment, suggesting that understanding and generation rely on distinct representational mechanisms. Training-dynamics analysis reveals a two-phase trajectory: rapid initial concept formation followed by architectural reorganization, during which semantic processing migrates from deep to mid-network layers as the model discovers increasingly efficient, sparser encodings. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and natural intelligence. This highlights the need for models that preserve the conceptual ``inefficiencies&#8217;&#8216; essential for human-like understanding.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 457]]></title><description><![CDATA[DeepSWE, The Best AI Observability Tools for Agentic Systems in 2026, a paper on SkillOpt: Executive Strategy for Self-Evolving Agent Skills, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-457</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-457</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Fri, 29 May 2026 15:02:20 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://deepswe.datacurve.ai/blog">DeepSWE</a>, <a href="https://www.comet.com/site/blog/ai-observability-tools/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-observability-tools/">The Best AI Observability Tools for Agentic Systems in 2026</a> and <a href="https://arxiv.org/abs/2605.23904">a paper on SkillOpt: Executive Strategy for Self-Evolving Agent Skills</a>.</p><p>You may also enjoy <a href="https://siliconangle.com/2026/05/19/google-reimagines-search-ai-agents-generative-interfaces/">Google reimagines search with AI agents and generative interfaces</a>, <a href="https://huggingface.co/blog/nvidia/nemotron-labs-diffusion#towards-speed-of-light-text-generation-with-nemotron-labs-diffusion-language-models">Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models</a>, <a href="https://arxiv.org/abs/2604.17609">a paper on Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://deepswe.datacurve.ai/blog">DeepSWE</a></strong></p><p>Datacurve introduces DeepSWE, a contamination-free coding benchmark of 113 from-scratch tasks across 91 repos and 5 languages, where GPT-5.5 leads at 70% and frontier models separate far more sharply than on SWE-Bench Pro.</p><p><strong><a href="https://siliconangle.com/2026/05/19/google-reimagines-search-ai-agents-generative-interfaces/">Google reimagines search with AI agents and generative interfaces</a></strong></p><p>Google overhauls Search at I/O 2026 with always-on Search Agents that monitor the web and report back, plus generative UI that builds interactive mini-apps on the fly via Antigravity and Gemini 3.5 Flash.</p><p><strong><a href="https://siliconangle.com/2026/05/26/openrouter-raises-113m-bring-order-enterprise-ai-inference-routing/">OpenRouter raises $113M to bring order to enterprise AI inference routing</a></strong></p><p>OpenRouter raises $113M Series B led by CapitalG to scale its multi-model inference routing platform.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/ai-observability-tools/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-observability-tools/">The Best AI Observability Tools for Agentic Systems in 2026</a></strong></p><p>A guide to the top AI observability tools in 2026 for agentic systems. Learn which platforms are best for tracing, evaluation, debugging, testing, and monitoring in production.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://www.comet.com/site/blog/weavecli-opik-project-example/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=weavecli-opik-project-example/">What Held Up at 3 AM: One Engineer&#8217;s RAG Case Study</a></strong></p><p>Learn how WeaveCLI, a unified command-line tool for RAG over eleven vector databases was built, using Opik&#8217;s tracing capabilities, and configurable retrieval pipelines.</p><p><strong><a href="https://openai.com/index/building-self-improving-tax-agents-with-codex/">Building self-improving tax agents with Codex</a></strong></p><p>OpenAI and Thrive Holdings build Tax AI, a Codex-driven self-improving agent that turns repeated accountant corrections into bounded evals</p><p><strong><a href="https://huggingface.co/blog/nvidia/nemotron-labs-diffusion#towards-speed-of-light-text-generation-with-nemotron-labs-diffusion-language-models">Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models</a></strong></p><p>NVIDIA releases Nemotron-Labs Diffusion, an open model family (3B/8B/14B plus an 8B VLM) that combines autoregressive, diffusion, and self-speculation modes in one checkpoint</p><p><strong><a href="https://medium.com/@MongoDB/the-agent-harness-why-the-llm-is-the-smallest-part-of-your-agent-system-bce68414ccfd">The Agent Harness: Why the LLM Is the Smallest Part of Your Agent System</a></strong></p><p>A technical article arguing that the LLM is the smallest part of a production agent system, with the real engineering living in a six-component harness and a deeper platform layer that determines production reliability.</p><p><strong><a href="https://medium.com/pinterest-engineering/an-engineers-guide-to-better-ai-skills-implementing-a-testing-process-to-optimize-agent-a000c9c9abcd">An Engineer&#8217;s Guide to Better AI Skills: Implementing a Testing Process to Optimize Agent Performance in Any Repository or Skill</a></strong></p><p>A practical guide from Pinterest Engineering on building a test harness to measure and improve how reliably coding agents invoke custom skills through frontmatter tuning and other techniques.</p><p><strong><a href="https://huggingface.co/blog/agent-glossary">Harness, Scaffold, and the AI Agent Terms Worth Getting Right</a></strong></p><p>A glossary that pins down the agent vocabulary people keep using loosely &#8212; clarifying the scaffold-versus-harness distinction and grounding terms like context engineering, policy, skills, and more.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/mksglu/context-mode">mksglu/context-mode</a></strong></p><p>Context window optimization for AI coding agents.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2605.23904">SkillOpt: Executive Strategy for Self-Evolving Agent Skills</a></strong></p><p><strong>Abstract:</strong></p><p>Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.</p><p><strong><a href="https://arxiv.org/abs/2604.17609">Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity</a></strong></p><p><strong>Abstract:</strong></p><p>LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task&#8217;s solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command &#8220;returns the complete solution to this task&#8221; in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 456]]></title><description><![CDATA[Gemini 3.5: frontier intelligence with action, Codex-maxxing, a paper on Lance: Unified Multimodal Modeling by Multi-Task Synergy, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-456</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-456</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 21 May 2026 15:03:46 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/#gemini-3-5-flash">Gemini 3.5: frontier intelligence with action</a>, <a href="https://jxnl.co/writing/2026/05/10/codex-maxxing/">Codex-maxxing</a> and <a href="https://arxiv.org/abs/2605.18678">a paper on Lance: Unified Multimodal Modeling by Multi-Task Synergy</a>.</p><p>You may also enjoy <a href="https://cohere.com/blog/command-a-plus">Introducing Command A+</a>, <a href="https://alignment.anthropic.com/2026/sleight-bench/">SLEIGHT-Bench: Finding Blind Spots in AI Monitors</a>, <a href="https://arxiv.org/abs/2605.12882">a paper on CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/#gemini-3-5-flash">Gemini 3.5: frontier intelligence with action</a></strong></p><p>Google launched Gemini 3.5, leading with 3.5 Flash&#8212;a model delivering flagship-tier agentic and coding performance at under half the cost.</p><p><strong><a href="https://cohere.com/blog/command-a-plus">Introducing Command A+</a></strong></p><p>Cohere released Command A+ open-source &#8212;a 218B/25B-active MoE model for enterprise agentic workflows that runs on as little as two H100s or one Blackwell GPU, supports 48 languages, and adds multimodal reasoning.</p><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/">Introducing Gemini Omni</a></strong></p><p>Google introduced Gemini Omni, a natively multimodal generation model debuting with Omni Flash&#8212;it creates and conversationally edits video from any combination of image, audio, video, and text inputs, rolling out across the Gemini app, Flow, and YouTube Shorts.</p><p><strong><a href="https://x.ai/news/grok-build-cli">Introducing Grok Build</a></strong></p><p>xAI launched Grok Build &#8212; a coding CLI powered by Grok 4.3 Heavy, featuring a 2M-token context window, 8 parallel subagents, and more.</p><p><strong><a href="https://bfl.ai/blog/outpainting-extend-any-image-in-any-direction">FLUX Outpainting: Extend any image, in any direction</a></strong></p><p>Black Forest Labs launched FLUX Outpainting, a purpose-built API endpoint that extends images in any direction without prompts.</p><p><strong><a href="https://cursor.com/blog/composer-2-5">Introducing Composer 2.5</a></strong></p><p>Cursor released Composer 2.5, a coding model (built on Moonshot&#8217;s Kimi K2.5) with gains on long-horizon agentic tasks.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/llm-cost-tracking-solution/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=llm-cost-tracking-solution/">LLM Cost Tracking Solution: How to Monitor and Control AI Spend in Agentic Systems</a></strong></p><p>A guide on treating LLM cost as an observability problem in agentic systems, using span/trace/project-level tracing to pinpoint token-burning prompts and routing.</p><p><strong><a href="https://redis.io/blog/context-is-all-you-need/">Context is all you need: Introducing Redis Iris</a></strong></p><p>Redis launched Redis Iris, a context engine sitting between agents and enterprise data&#8212;bundling five tools (two new: Context Retriever and Agent Memory) to deliver navigable, fresh, low-latency context with semantic caching that cuts token costs up to 90%.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://weaviate.io/blog/tokenization-text-analysis-weaviate">Text Analysis for Hybrid Search: Tokenization, Stopwords &amp; Accent Folding</a></strong></p><p>A technical guide on how Weaviate v1.37 makes BM25 tokenization observable and per-property configurable&#8212;covering accent folding, per-language stopwords, and more.</p><p><strong><a href="https://pytorch.org/blog/running-pytorch-models-on-apple-silicon-gpus-with-the-executorch-mlx-delegate/">Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate</a></strong></p><p>PyTorch released the experimental ExecuTorch MLX delegate, a backend that runs PyTorch models on Apple Silicon GPUs via Apple&#8217;s MLX framework</p><p><strong><a href="https://jxnl.co/writing/2026/05/10/codex-maxxing/">Codex-maxxing</a></strong></p><p>A power user&#8217;s playbook for extracting more value from Codex&#8212;using durable threads, file-based memory, verifiable goals, and self-scheduling loops to turn it into a workspace where long-running knowledge work keeps progressing between sessions.</p><p><strong><a href="https://alignment.anthropic.com/2026/sleight-bench/">SLEIGHT-Bench: Finding Blind Spots in AI Monitors</a></strong></p><p>Anthropic researchers released SLEIGHT-Bench, a benchmark of 40 synthetic attacks across 11 categories that exploit &#8220;blind spots&#8221; in frontier AI monitors&#8212;on the Opus 4.6 monitor, 50% of attacks evaded all 10 trials and only 8 of 40 were reliably caught.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/generalaction/emdash">generalaction/emdash</a></strong></p><p>Emdash is the Open-Source Agentic Development Environment. Run multiple coding agents in parallel. Use any provider.</p><h2><strong>Papers &amp; Publications</strong></h2><p> <strong><a href="https://arxiv.org/abs/2605.18678">Lance: Unified Multimodal Modeling by Multi-Task Synergy</a></strong></p><p><strong>Abstract:</strong></p><p>We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities.</p><p><strong><a href="https://arxiv.org/abs/2605.12882">CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence</a></strong></p><p><strong>Abstract:</strong></p><p>Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 455]]></title><description><![CDATA[Interaction Models: A Scalable Approach to Human-AI Collaboration, Hidden Technical Debt of AI Systems: Agent Harness, a paper on Efficient Online Memory for Large Language Models, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-455</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-455</guid><dc:creator><![CDATA[Deep Learning Weekly]]></dc:creator><pubDate>Thu, 14 May 2026 15:02:11 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://thinkingmachines.ai/blog/interaction-models/">Interaction Models: A Scalable Approach to Human-AI Collaboration</a>, <a href="https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/">Hidden Technical Debt of AI Systems: Agent Harness</a> and <a href="https://arxiv.org/abs/2605.12357">a paper on &#948;-mem: Efficient Online Memory for Large Language Models</a>.</p><p>You may also enjoy <a href="https://www.perceptron.inc/blog/introducing-perceptron-mk1">Introducing Perceptron Mk1</a>, <a href="https://www.anthropic.com/research/teaching-claude-why">Teaching Claude why</a>, <a href="https://arxiv.org/abs/2605.03546">a paper on ProgramBench: Can Language Models Rebuild Programs From Scratch?</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.perceptron.inc/blog/introducing-perceptron-mk1">Introducing Perceptron Mk1</a></strong></p><p>Perceptron AI launches Mk1, a video and embodied-reasoning vision-language model priced roughly 80&#8211;90% cheaper than Claude Sonnet 4.5, GPT-5, and Gemini 3.1 Pro.</p><p><strong><a href="https://techcrunch.com/2026/05/13/notion-just-turned-its-workspace-into-a-hub-for-ai-agents/">Notion just turned its workspace into a hub for AI agents</a></strong></p><p>Notion launches Developer Platform turning its workspace into an agent orchestration hub with custom code Workers, external database sync, and native integrations for Claude Code, Cursor, Codex, and Decagon.</p><p><strong><a href="https://thinkingmachines.ai/blog/interaction-models/">Interaction Models: A Scalable Approach to Human-AI Collaboration</a></strong></p><p>Thinking Machines unveils TML-Interaction-Small, a 276B MoE (12B active) interaction model trained from scratch with 200ms time-aligned micro-turns that natively handles concurrent audio, video, and text without VAD-style harnesses.</p><p><strong><a href="https://unsloth.ai/blog/pytorch">Unsloth Joins the PyTorch Ecosystem</a></strong></p><p>Unsloth joins the PyTorch Ecosystem Landscape, recognizing its open-source contributions including 2&#215; faster training with 70% less VRAM, FP8 RL for consumer GPUs, and 250M+ model downloads.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/">Hidden Technical Debt of AI Systems: Agent Harness</a></strong></p><p>A Hanchung Lee essay reframing Sculley&#8217;s 2015 ML technical debt diagram for the agent era, arguing the agent runtime (harness + state) &#8212; not the model &#8212; is where most spend, incidents, and architectural debt are now accumulating.</p><p><strong><a href="https://huggingface.co/blog/amazon/foundation-model-building-blocks">Building Blocks for Foundation Model Training and Inference on AWS</a></strong></p><p>A reference guide from Amazon mapping AWS&#8217;s four-layer infrastructure stack to foundation model pre-training, post-training, and inference workloads.</p><p><strong><a href="https://developer.nvidia.com/blog/how-to-eliminate-pipeline-friction-in-ai-model-serving/">How to Eliminate Pipeline Friction in AI Model Serving</a></strong></p><p>A practical NVIDIA guide laying out 18 best practices to eliminate AI model-serving friction across export issues, unsupported ops, dynamic input shapes, and version mismatches</p><h2><strong>Learning</strong></h2><p><strong><a href="https://www.anthropic.com/research/teaching-claude-why">Teaching Claude why</a></strong></p><p>An Anthropic post detailing how teaching Claude why actions are aligned &#8212; via constitutional documents and ethical reasoning, not just demonstrations &#8212; drove blackmail rates from 96% (Opus 4) to 0% on every Claude model since Haiku 4.5.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/">Accelerating Gemma 4: faster inference with multi-token prediction drafters</a></strong></p><p>Google releases Multi-Token Prediction (MTP) drafters for Gemma 4 models, delivering up to 3x faster inference via speculative decoding with zero quality degradation.</p><p><strong><a href="https://simonw.substack.com/p/vibe-coding-and-agentic-engineering">Vibe coding and agentic engineering are getting closer than I&#8217;d like</a></strong></p><p>A post by Simon Willison observing that vibe coding and agentic engineering are converging in his own workflow as he increasingly ships production code from Claude Code without reviewing every line.</p><p><strong><a href="https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing">How fast is autonomous AI cyber capability advancing?</a></strong></p><p>UK AISI reports the length of cyber tasks frontier models can autonomously complete is doubling every 4.7 months &#8212; accelerating from 8 months last November &#8212; with Claude Mythos Preview and GPT-5.5 exceeding even that trend.</p><p><strong><a href="https://deepmind.google/blog/ai-pointer/">Reimagining the mouse pointer for the AI era</a></strong></p><p>A design-principles post from Google DeepMind reframing the mouse pointer as a Gemini-powered context-aware partner, built on four principles: maintain the flow, show and tell, embrace &#8220;this/that&#8221; deixis, and turn pixels into actionable entities.</p><p><strong><a href="https://www.pinecone.io/blog/full-text-search-architecture/">Full Text Search: Architecture and Design</a></strong></p><p>A technical architecture post from Pinecone introducing full-text search built on Tantivy, delivering Lucene query syntax, BM25 scoring, 18-language tokenization, and 22.7ms p50 latency on 6.4M Wikipedia articles.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2605.12357">&#948;-mem: Efficient Online Memory for Large Language Models</a></strong></p><p><strong>Abstract:</strong></p><p>Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose &#948;-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. &#948;-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone&#8217;s attention computation during generation. With only an 8&#215;8 online memory state, &#948;-mem improves the average score to 1.10&#215; that of the frozen backbone and 1.15&#215; that of the strongest non-&#948;-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching 1.31&#215; on MemoryAgentBench and 1.20&#215; on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.</p><p><strong><a href="https://arxiv.org/abs/2605.03546">ProgramBench: Can Language Models Rebuild Programs From Scratch?</a></strong></p><p><strong>Abstract:</strong></p><p>Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable&#8217;s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.</p>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 454]]></title><description><![CDATA[MolmoAct 2: An open foundation for robots, How to Work and Compound with AI, a paper on ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-454</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-454</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 07 May 2026 15:03:11 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://allenai.org/blog/molmoact2">MolmoAct 2: An open foundation for robots</a>, <a href="https://eugeneyan.com/writing/working-with-ai/">How to Work and Compound with AI</a> and <a href="https://arxiv.org/abs/2605.03042">a paper on ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration</a>.</p><p>You may also enjoy <a href="https://openai.com/index/gpt-5-5-instant/">GPT-5.5 Instant</a>, <a href="https://epoch.ai/blog/chips-topic-overview">AI Chips: why they cost as much as a car, and why companies can&#8217;t get enough</a>, <a href="https://arxiv.org/abs/2605.04036">a paper on OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://allenai.org/blog/molmoact2">MolmoAct 2: An open foundation for robots that work in the real world</a></strong></p><p>Ai2 releases MolmoAct 2, a fully open robotics foundation model running up to 37x faster than its predecessor, alongside the largest open-source bimanual manipulation dataset.</p><p><strong><a href="https://openai.com/index/gpt-5-5-instant/">GPT-5.5 Instant: smarter, clearer, and more personalized</a></strong></p><p>OpenAI releases GPT-5.5 Instant as ChatGPT&#8217;s new default model, cutting hallucinations by 52.5% on high-stakes prompts and using 30% fewer words while pulling context from past chats, files, and more.</p><p><strong><a href="https://x.ai/news/anthropic-compute-partnership">New Compute Partnership with Anthropic</a></strong></p><p>SpaceXAI signs a deal with rival Anthropic for full access to Colossus 1, with Anthropic also expressing interest in jointly developing multi-gigawatt orbital compute.</p><p><strong><a href="https://siliconangle.com/2026/05/05/blitzy-raises-200m-1-4b-valuation-deploy-thousands-coding-agents-parallel/">Blitzy raises $200M at $1.4B valuation to deploy thousands of coding agents in parallel</a></strong></p><p>Blitzy raises $200M at $1.4B valuation to scale its enterprise platform that orchestrates thousands of parallel coding agents across 100M+ line legacy codebases, scoring 66.5% on SWE-Bench Pro.</p><p><strong><a href="https://siliconangle.com/2026/05/06/monday-com-relaunches-ai-work-platform-native-agents/">Monday.com relaunches as an AI work platform with native agents</a></strong></p><p>Monday.com relaunches as an &#8220;AI work platform&#8221; with native agents that draft campaigns, qualify leads, and triage tickets across its 250,000+ customers, plus one-click connectors to Claude, ChatGPT, Copilot, and Gemini.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/end-to-end-agent-testing/">Introducing the Opik Agent Playground</a></strong></p><p>Comet launches Opik Agent Playground, a UI-based environment for testing and tweaking full agent configurations (prompts, models, tools) without touching code, opening iteration to PMs and domain experts.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://epoch.ai/blog/chips-topic-overview">AI Chips: why they cost as much as a car, and why companies can&#8217;t get enough</a></strong></p><p>A primer on AI chip economics and supply: the entire frontier flows through TSMC and a handful of designers, with total compute capacity doubling every 7 months while performance-per-dollar doubles every 2.5 years.</p><p><strong><a href="https://www.philschmid.de/subagent-patterns-2026">How Agents Manage Other Agents: Four Subagents Patterns in 2026</a></strong></p><p>A practical blog post about four subagent orchestration patterns&#8212;inline tool, fan-out, agent pool, and teams&#8212;each requiring progressively more capable models and offering different tradeoffs in control, lifetime, and result collection.</p><p><strong><a href="https://leehanchung.github.io/blogs/2026/05/01/dont-outsource-your-understanding/">Don&#8217;t Outsource Your Understanding</a></strong></p><p>An essay arguing the real AI risk isn&#8217;t using it but &#8220;cognitive surrender&#8221;&#8212;outsourcing verification too&#8212;evidenced by 1,300+ hallucinated court filings and 50 ICLR papers with fake citations.</p><p><strong><a href="https://huggingface.co/blog/ibm-granite/granite-4-1">Granite 4.1 LLMs: How They&#8217;re Built</a></strong></p><p>An technical deep-dive on how Granite 4.1 was built: 15T-token, five-phase pretraining with long-context extension to 512K, SFT on 4.1M curated samples, and on-policy GRPO with DAPO loss.</p><p><strong><a href="https://eugeneyan.com/writing/working-with-ai/">How to Work and Compound with AI</a></strong></p><p>A practitioner&#8217;s playbook for compounding with AI: treat context as infra, encode taste as config (CLAUDE.md, skills), make verification cheap, delegate bigger chunks in parallel, and mine transcripts to close the loop.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/zilliztech/claude-context">zilliztech/claude-context</a></strong></p><p>Claude Context is an MCP plugin that adds semantic code search to Claude Code and other AI coding agents, giving them deep context from your entire codebase.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2605.03042">ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration</a></strong></p><p><strong>Abstract:</strong></p><p>This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor&#8217;s framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.</p><p><strong><a href="https://arxiv.org/abs/2605.04036">OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories</a></strong></p><p><strong>Abstract:</strong></p><p>Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity&#8217;s Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 453]]></title><description><![CDATA[OpenAI models come to AWS, Hidden Technical Debt of AI Systems: Agent Runtime, a paper on Recursive Multi-Agent Systems, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-453</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-453</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 30 Apr 2026 15:02:50 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://openai.com/index/openai-on-aws/">OpenAI models, Codex, and Managed Agents come to AWS</a>, <a href="https://leehanchung.github.io/blogs/2026/04/24/hidden-technical-debt-agent-runtime/">Hidden Technical Debt of AI Systems: Agent Runtime</a> and <a href="https://arxiv.org/abs/2604.25917">a paper on Recursive Multi-Agent Systems</a>.</p><p>You may also enjoy <a href="https://venturebeat.com/technology/mistral-ai-launches-workflows-a-temporal-powered-orchestration-engine-already-running-millions-of-daily-executions">Mistral AI&#8217;s Workflows</a>, <a href="https://research.google/blog/four-ways-google-research-scientists-have-been-using-empirical-research-assistance/">Four ways Google Research scientists have been using Empirical Research Assistance</a>, <a href="https://arxiv.org/abs/2604.22446">a paper on From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://openai.com/index/openai-on-aws/">OpenAI models, Codex, and Managed Agents come to AWS</a></strong></p><p>OpenAI brings GPT-5.5, Codex, and Managed Agents to Amazon Bedrock in limited preview, a day after ending Microsoft&#8217;s exclusive cloud license.</p><p><strong><a href="https://venturebeat.com/technology/mistral-ai-launches-workflows-a-temporal-powered-orchestration-engine-already-running-millions-of-daily-executions">Mistral AI launches Workflows, a Temporal-powered orchestration engine already running millions of daily executions</a></strong></p><p>Mistral AI launches Workflows in public preview, a Temporal-powered durable execution engine that lets enterprises run human-in-the-loop AI processes with data staying inside their own infrastructure.</p><p><strong><a href="https://siliconangle.com/2026/04/27/ineffable-intelligence-raises-1-1b-5-1b-valuation-build-ai-superlearner/">Ineffable Intelligence raises $1.1B at $5.1B valuation to build an AI &#8216;superlearner&#8217;</a></strong></p><p>AlphaGo creator David Silver&#8217;s new British startup Ineffable Intelligence raises $1.1B to build a &#8220;superlearner&#8221; AI that generates entirely new knowledge via RL without pretraining.</p><p><strong><a href="https://venturebeat.com/technology/american-ai-startup-poolside-launches-free-high-performing-open-model-laguna-xs-2-for-local-agentic-coding">American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding</a></strong></p><p>Poolside releases two agentic coding models &#8212; the open-weight Laguna XS.2 and proprietary Laguna M.1 &#8212; both trained from scratch and free to use temporarily via API, alongside a terminal coding agent and web IDE.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://leehanchung.github.io/blogs/2026/04/24/hidden-technical-debt-agent-runtime/">Hidden Technical Debt of AI Systems: Agent Runtime</a></strong></p><p>A technical blog post arguing that the agent runtime &#8212; the sandboxed execution environment wrapping the model &#8212; is the emerging hidden technical debt of AI systems.</p><p><strong><a href="https://venturebeat.com/infrastructure/context-decay-orchestration-drift-and-the-rise-of-silent-failures-in-ai-systems">Context decay, orchestration drift, and the rise of silent failures in AI systems</a></strong></p><p>A practical guide about four enterprise AI failure patterns &#8212; context degradation, orchestration drift, silent partial failure, and automation blast radius &#8212; that standard infrastructure monitoring cannot detect, and what teams must add to catch them.</p><p><strong><a href="https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns">Monitoring LLM behavior: Drift, retries, and refusal patterns</a></strong></p><p>A practical guide on instrumenting LLM applications with two complementary evaluation pipelines &#8212; offline regression testing and online behavioral telemetry &#8212; to detect model drift before and after deployment.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/four-ways-google-research-scientists-have-been-using-empirical-research-assistance/">Four ways Google Research scientists have been using Empirical Research Assistance</a></strong></p><p>A Google Research blog post on four real-world applications of their Empirical Research Assistance (ERA) tool &#8212; an LLM-backed system for generating expert-level scientific software &#8212; spanning epidemiology, cosmology, climate monitoring, and neuroscience.</p><p><strong><a href="https://alignment.anthropic.com/2026/ai-organizations/">AI Organizations Can Be More Effective but Less Aligned than Individual Agents</a></strong></p><p>An Anthropic research paper found that teams of individually aligned AI agents can still collectively produce less ethical &#8212; but more effective &#8212; solutions than a single agent, suggesting AI safety research needs to move beyond studying agents in isolation.</p><p><strong><a href="https://blog.ml.cmu.edu/2026/04/27/arfbench/">Introducing ARFBench: A time series question-answering benchmark based on real incidents</a></strong></p><p>CMU and Datadog introduce ARFBench, a 750-question time series QA benchmark derived from real production incidents, where the best model (GPT-5 at 62.7% accuracy) still trails domain experts by ~9 points but a hybrid TSFM-VLM oracle reaches 87.2%.</p><p><strong><a href="https://deepmind.google/blog/decoupled-diloco/">Decoupled DiLoCo: Resilient, Distributed AI Training at Scale</a></strong></p><p>Google DeepMind releases Decoupled DiLoCo, a fault-tolerant distributed training architecture that achieves 88% goodput vs. 27% for standard data-parallel at scale, using ~240x less inter-datacenter bandwidth with no measurable ML performance loss.</p><p><strong><a href="https://www.philschmid.de/use-mcp-servers">How to correctly use MCP servers with your AI Agents</a></strong></p><p>A practical guide on avoiding context bloat from MCP servers by using two patterns &#8212; user-triggered @mention injection for ad-hoc tool loading, and scoped subagent declarations.</p><p><strong><a href="https://deepseek.ai/blog/deepseek-v4-compressed-attention">DeepSeek V4 Compressed Attention: How the KV-Cache Shrinks to Just 2%</a></strong></p><p>A technical explainer on how DeepSeek V4 achieves 1M-token context windows by compressing the KV cache to just 2% of standard size &#8212; combining coarse and fine-grained sequence-dimension compression across a hybrid layer stack.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/facebookresearch/sapiens2">facebookresearch/sapiens2</a></strong></p><p>1K resolution vision transformers pretrained on 1B human images.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2604.25917">Recursive Multi-Agent Systems</a></strong></p><p><strong>Abstract:</strong></p><p>Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2&#215;-2.4&#215; end-to-end inference speedup, and 34.6%-75.6% token usage reduction.</p><p><strong><a href="https://arxiv.org/abs/2604.22446">From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company</a></strong></p><p><strong>Abstract:</strong></p><p>Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)}, a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emph{Talents}, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emph{Talent Market} enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emph{Explore-Execute-Review} (E2R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an 84.67% success rate, surpassing the state of the art by 15.48 percentage points, with cross-domain case studies further demonstrating its generality.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 452]]></title><description><![CDATA[Introducing Ollie: Auto-Fix Your Agent&#8217;s Codebase, Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles, a paper on Adam's Law: Textual Frequency Law o]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-452</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-452</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 23 Apr 2026 15:01:16 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.comet.com/site/blog/self-improving-agents/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=self-improving-agents/">Introducing Ollie: Auto-Fix Your Agent&#8217;s Codebase</a>, <a href="https://research.google/blog/designing-synthetic-datasets-for-the-real-world-mechanism-design-and-reasoning-from-first-principles/">Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles</a> and <a href="https://arxiv.org/abs/2604.02176">a paper on Adam&#8217;s Law: Textual Frequency Law on Large Language Models</a>.</p><p>You may also enjoy <a href="https://www.anthropic.com/news/claude-opus-4-7">Claude Opus 4.7</a>, <a href="https://zilliz.com/blog/notion-vector-search-next-problem">Notion Vector Search Architecture</a>, <a href="https://openreview.net/forum?id=7xjoTuaNmN">OpenThoughts: Data Recipes for Reasoning Models</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://openai.com/index/introducing-chatgpt-images-2-0/">Introducing ChatGPT Images 2.0</a></strong></p><p>OpenAI releases ChatGPT Images 2.0, its first image model with native reasoning and web search, generating up to 8 coherent images per prompt at up to 2K resolution.</p><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-7">Introducing Claude Opus 4.7 \ Anthropic</a></strong></p><p>Anthropic releases Claude Opus 4.7, a coding-focused upgrade over Opus 4.6 with significantly improved vision, a new xhigh effort level, and real-world cyber safeguards.</p><p><strong><a href="https://openai.com/index/introducing-openai-privacy-filter/">Introducing OpenAI Privacy Filter</a></strong></p><p>OpenAI releases Privacy Filter, a 1.5B-parameter open-source, on-device PII detection and redaction model derived from gpt-oss, scoring 96% F1 on PII-Masking-300k.</p><p><strong><a href="https://www.kimi.com/blog/kimi-k2-6">Kimi K2.6 Tech Blog: Advancing Open-Source Coding</a></strong></p><p>Moonshot AI open-sources Kimi K2.6, a coding and long-horizon agent model that scales agent swarms to 300 concurrent sub-agents across 4,000 coordinated steps, with benchmark results competitive with GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro and agentic tasks.</p><p><strong><a href="https://venturebeat.com/technology/googles-new-deep-research-and-deep-research-max-agents-can-search-the-web-and-your-private-data">Google&#8217;s new Deep Research and Deep Research Max agents can search the web and your private data</a></strong></p><p>Google launches two Gemini 3.1 Pro-powered autonomous research agents &#8212; Deep Research and Deep Research Max &#8212; that combine open web search with proprietary enterprise data via MCP in a single API call.</p><h2><strong>MLOps/LLMOps/AgentOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/self-improving-agents/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=self-improving-agents/">Introducing Ollie: Auto-Fix Your Agent&#8217;s Codebase</a></strong></p><p>Comet announces Ollie, a coding assistant embedded in the Opik platform that closes the observability-to-action loop by autonomously analyzing agent traces, diagnosing failures, patching code, and writing regression tests &#8212; all within a single workflow.</p><p><strong><a href="https://www.comet.com/site/blog/ai-agent-regression-testing/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-agent-regression-testing/">Introducing Opik Test Suites: Straightforward Unit &amp; Regression Testing for AI Agents</a></strong></p><p>Comet announces Opik Test Suites, a regression testing framework for AI agents that replaces dataset-based evaluation scores with software-style pass/fail assertions written in plain English.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/designing-synthetic-datasets-for-the-real-world-mechanism-design-and-reasoning-from-first-principles/">Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles</a></strong></p><p>A Google Research blog post introducing Simula, a reasoning-first synthetic data framework that treats dataset generation as mechanism design &#8212; controlling diversity, complexity, and quality as independent axes.</p><p><strong><a href="https://opensearch.org/blog/benchmarking-multimodal-document-search-in-opensearch-three-approaches-compared/">Benchmarking multimodal document search in OpenSearch: Three approaches compared</a></strong></p><p>A technical benchmark comparing ColPali late-interaction reranking, BDA modality-aware embedding, and text-only chunking for multimodal document search in OpenSearch across quality, latency, and ingest performance on 1,000 report pages.</p><p><strong><a href="https://zilliz.com/blog/notion-vector-search-next-problem">Notion Vector Search Architecture: What Comes Next</a></strong></p><p>A blog post analyzing Notion&#8217;s two-year vector search evolution as a proxy for the harder infrastructure problems &#8212; offline context engineering, embedding model upgrades, and real-time/batch unification &#8212; that scaling multiple AI features will demand next.</p><p><strong><a href="https://weaviate.io/blog/engram-deep-dive">Engram: Memory by Weaviate</a></strong></p><p>Weaviate announces Engram, a managed memory service that uses async pipelines to extract, deduplicate, and maintain agent memories on top of Weaviate&#8217;s vector database.</p><p><strong><a href="https://alignment.anthropic.com/2026/automated-w2s-researcher/">Automated Weak-to-Strong Researcher</a></strong></p><p>Anthropic&#8217;s Claude-powered Automated Alignment Researcher achieves a 0.97 performance gap recovered score on weak-to-strong supervision in 5 days &#8212; versus 0.23 by human researchers in 7 days.</p><p><strong><a href="https://embracethered.com/blog/posts/2026/breaking-opus-4.7-with-chatgpt/">Breaking Opus 4.7 with ChatGPT (Hacking Claude&#8217;s Memory)</a></strong></p><p>A security research post demonstrating a ChatGPT-generated adversarial image that successfully hijacked Claude Opus 4.7&#8217;s memory tool via indirect prompt injection &#8212; succeeding 5 out of 10 attempts before Anthropic patched the specific exploit within 24 hours.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/google/skills/tree/main">google/skills</a></strong></p><p>Agent Skills for Google products and technologies</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2604.02176">Adam&#8217;s Law: Textual Frequency Law on Large Language Models</a></strong></p><p><strong>Abstract:</strong></p><p>While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.</p><p><strong><a href="https://openreview.net/forum?id=7xjoTuaNmN">OpenThoughts: Data Recipes for Reasoning Models</a></strong></p><p><strong>Abstract:</strong></p><p>Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond &#8211; improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 451]]></title><description><![CDATA[The 2026 AI Index Report, MirrorCode: Evidence that AI can already do some weeks-long coding tasks, a paper on Introspective Diffusion Language Models, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-451</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-451</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 16 Apr 2026 15:03:11 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://hai.stanford.edu/ai-index/2026-ai-index-report">The 2026 AI Index Report</a>, <a href="https://epoch.ai/blog/mirrorcode-preliminary-results">MirrorCode: Evidence that AI can already do some weeks-long coding tasks</a> and <a href="https://arxiv.org/abs/2604.11035">a paper on Introspective Diffusion Language Models</a>.</p><p>You may also enjoy <a href="https://deepmind.google/blog/gemini-robotics-er-1-6/">Gemini Robotics ER 1.6: Enhanced Embodied Reasoning</a>, <a href="https://blog.ml.cmu.edu/2026/04/13/when-should-ai-step-aside-teaching-agents-when-humans-want-to-intervene/">Should AI Step Aside?: Teaching Agents When Humans Want to Intervene</a>, <a href="https://arxiv.org/abs/2604.11641">a paper on CodeTracer: Towards Traceable Agent States</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://hai.stanford.edu/ai-index/2026-ai-index-report">The 2026 AI Index Report | Stanford HAI</a></strong></p><p>Stanford HAI releases the 2026 AI Index &#8212; a 400+ page annual report tracking AI&#8217;s technical performance, investment, labor market effects, policy landscape, and public sentiment across nine chapters.</p><p><strong><a href="https://deepmind.google/blog/gemini-robotics-er-1-6/">Gemini Robotics ER 1.6: Enhanced Embodied Reasoning</a></strong></p><p>Google DeepMind releases Gemini Robotics-ER 1.6, a robotics-specialized reasoning model with upgraded spatial reasoning, multi-view success detection, and instrument reading (93% accuracy with agentic vision).</p><p><strong><a href="https://ai.meta.com/blog/introducing-muse-spark-msl/">Introducing Muse Spark: Scaling Towards Personal Superintelligence</a></strong></p><p>Meta&#8217;s Superintelligence Labs launches Muse Spark &#8212; a natively multimodal reasoning model with multi-agent &#8220;Contemplating&#8221; mode that achieves 58% on Humanity&#8217;s Last Exam.</p><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/">Gemini 3.1 Flash TTS: the next generation of expressive AI speech</a></strong></p><p>Google launches Gemini 3.1 Flash TTS, a text-to-speech model with natural-language audio tags for granular vocal control across 70+ languages, available via Gemini API, Vertex AI, and Google Vids.</p><p><strong><a href="https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-mai-image-2-efficient-faster-more-efficient-image-generation/4510918">Introducing MAI-Image-2-Efficient: Faster, More Efficient Image Generation</a></strong></p><p>Microsoft releases MAI-Image-2-Efficient &#8212; 22% faster and 4x more GPU-efficient than MAI-Image-2, targeting high-volume and real-time image generation workloads.</p><p><strong><a href="https://claude.com/blog/introducing-routines-in-claude-code">Introducing routines in Claude Code</a></strong></p><p>Anthropic launches Routines in Claude Code &#8212; serverless automations triggered by schedule, API call, or GitHub webhook events, with daily limits of 5&#8211;25 runs depending on plan tier.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/multimodal-llm-evaluation/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=multimodal-llm-evaluation//">Multimodal LLM Evaluation: A Developer&#8217;s Guide to Multimodal Language Models</a></strong></p><p>A guide to evaluating multimodal LLMs, highlighting why text-only metrics fall short for image, audio, and video inputs, while outlining methods for grounding outputs and using LLM-based evaluation to measure real-world performance.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://blog.ml.cmu.edu/2026/04/13/when-should-ai-step-aside-teaching-agents-when-humans-want-to-intervene/">Should AI Step Aside?: Teaching Agents When Humans Want to Intervene</a></strong></p><p>A research blog post introducing CowCorpus and PlowPilot &#8212; a dataset and intervention-aware web agent system that predicts when users want to take over, yielding a 26.5% improvement in user-rated usefulness over a fully autonomous baseline.</p><p><strong><a href="https://epoch.ai/blog/mirrorcode-preliminary-results">MirrorCode: Evidence that AI can already do some weeks-long coding tasks</a></strong></p><p>A research report from Epoch AI introducing MirrorCode, a long-horizon coding benchmark, showing Claude Opus 4.6 can autonomously reimplement a 16,000-line bioinformatics toolkit estimated to take a human engineer 2&#8211;17 weeks.</p><p><strong><a href="https://www.philschmid.de/agent-skills-tips">8 Tips for Writing Agent Skills</a></strong></p><p>A practical guide on authoring effective agent skills, covering description precision, instruction conciseness, layered context loading, and when to retire skills as model capabilities advance.</p><p><strong><a href="https://www.quantamagazine.org/the-ai-revolution-in-math-has-arrived-20260413/">The AI Revolution in Math Has Arrived</a></strong></p><p>A Quanta Magazine feature documenting how AI has become a genuine research accelerator, with mathematicians using it to discover and prove new results in days rather than months.</p><p><strong><a href="https://unsloth.ai/docs/models/gemma-4/train">Gemma 4 Fine-tuning Guide</a></strong></p><p>Unsloth&#8217;s technical guide for fine-tuning Google&#8217;s Gemma 4 family covering VRAM requirements, critical bug fixes for KV-sharing and gradient accumulation, and recipes for SFT, vision, audio, and GRPO training.</p><p><strong><a href="https://research.google/blog/towards-developing-future-ready-skills-with-generative-ai/">Towards developing future-ready skills with generative AI</a></strong></p><p>A Google blog post introducing Vantage, a GenAI-powered assessment platform that places students in AI-simulated multi-party conversations to measure &#8220;future-ready&#8221; skills.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f">LLM Wiki</a></strong></p><p>A pattern for building personal knowledge bases using LLMs.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2604.11035">Introspective Diffusion Language Models</a></strong></p><p><strong>Abstract:</strong></p><p>Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.</p><p><strong><a href="https://arxiv.org/abs/2604.11641">CodeTracer: Towards Traceable Agent States</a></strong></p><p><strong>Abstract:</strong></p><p>Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent&#8217;s state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 450]]></title><description><![CDATA[Gemma 4, Components of A Coding Agent, a paper on VOID: Video Object and Interaction Deletion, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-450</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-450</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 09 Apr 2026 15:01:40 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/">Gemma 4</a>, <a href="https://magazine.sebastianraschka.com/p/components-of-a-coding-agent">Components of A Coding Agent</a>, and <a href="https://arxiv.org/abs/2604.02296">a paper on VOID: Video Object and Interaction Deletion</a>.</p><p>You may also enjoy <a href="https://claude.com/blog/claude-managed-agents">Claude Managed Agents</a>, <a href="https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/">Evaluating alignment of behavioral dispositions in LLMs</a>, <a href="https://arxiv.org/abs/2604.04921">a paper on TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://qwen.ai/blog?id=qwen3.6">Qwen: Qwen3.6-Plus: Towards Real World Agents</a></strong></p><p>Alibaba launches Qwen3.6-Plus, a frontier agentic coding model that matches or beats Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0.</p><p><strong><a href="https://claude.com/blog/claude-managed-agents">Claude Managed Agents: get to production 10x faster</a></strong></p><p>Anthropic launches Claude Managed Agents in public beta &#8212; a suite of composable, cloud-hosted agent APIs that abstract away sandboxing, state management, permissioning, and orchestration, enabling teams to ship production agents in days instead of months.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/">Gemma 4: Our most capable open models to date</a></strong></p><p>Google releases Gemma 4, a family of four open models &#8212; with the 31B ranking #3 among open models on Arena AI and outcompeting models 20x its size.</p><p><strong><a href="https://siliconangle.com/2026/04/07/modus-secures-85m-expand-ai-powered-audit-accounting-partnerships/">Modus secures $85M to expand AI-powered audit and accounting partnerships</a></strong></p><p>Modus Audit raises $85M to deploy AI across audit and accounting firm workflows.</p><p><strong><a href="https://ollama.com/blog/mlx">Ollama is now powered by MLX on Apple Silicon in preview</a></strong></p><p>Ollama 0.19 launches MLX-powered inference on Apple Silicon, delivering ~2x gains in prefill and decode speed on M5 chips, with NVFP4 quantization support and smarter KV cache reuse for agentic workloads.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/ai-agent-evaluation/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=ai-agent-evaluation//">AI Agent Evaluation: Building Reliable Systems Beyond Simple Testing</a></strong></p><p>A practical guide on why standard LLM evaluation breaks for agentic systems, covering compounding failures, process vs. outcome metrics, multi-turn state tracking, and the trace-evaluate-optimize loop needed for production agents.</p><p><strong><a href="https://aws.amazon.com/blogs/machine-learning/simulate-realistic-users-to-evaluate-multi-turn-ai-agents-in-strands-evals/">Simulate realistic users to evaluate multi-turn AI agents in Strands Evals</a></strong></p><p>A technical blog about ActorSimulator in AWS&#8217;s Strands Evals SDK, which generates persona-consistent, goal-driven simulated users to automate multi-turn agent evaluation at scale.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://magazine.sebastianraschka.com/p/components-of-a-coding-agent">Components of A Coding Agent</a></strong></p><p>A breakdown by Sebastian Raschka of the six architectural components that make coding agents (Claude Code, Codex CLI) meaningfully more capable than raw LLMs in a chat UI</p><p><strong><a href="https://ngrok.com/blog/quantization">Quantization from the ground up</a></strong></p><p>A highly interactive, ground-up explainer on LLM quantization covering floating point formats, symmetric vs. asymmetric compression, outlier handling, and empirical quality/speed tradeoffs.</p><p><strong><a href="https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/">Evaluating alignment of behavioral dispositions in LLMs</a></strong></p><p>A blog post on evaluating behavioral alignment across 25 LLMs, finding frontier models hit ~80&#8211;83% alignment with human consensus but are systematically overconfident in ambiguous scenarios and inconsistent between self-reported and revealed behavior.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/NousResearch/hermes-agent">NousResearch/hermes-agent</a></strong></p><p>The self-improving AI agent built by Nous Research. It&#8217;s the only agent with a built-in learning loop.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2604.02296">VOID: Video Object and Interaction Deletion</a></strong></p><p><strong>Abstract:</strong></p><p>Existing video object removal methods excel at inpainting content &#8220;behind&#8221; the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.</p><p><strong><a href="https://arxiv.org/abs/2604.04921">TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</a></strong></p><p><strong>Abstract:</strong></p><p>Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 449]]></title><description><![CDATA[Gemini 3.1 Flash Live, Cohere Transcribe: state-of-the-art speech recognition, a paper on IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-449</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-449</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 02 Apr 2026 15:03:12 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-live/">Gemini 3.1 Flash Live</a>, <a href="https://cohere.com/blog/transcribe">Cohere Transcribe: state-of-the-art speech recognition</a>, and <a href="https://arxiv.org/abs/2603.12201">a paper on IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse</a>.</p><p>You may also enjoy <a href="https://mistral.ai/news/voxtral-tts">Mistral AI&#8217;s Voxtral</a>, <a href="https://www.philschmid.de/kimi-composer-context">How Kimi, Cursor, and Chroma Train Agentic Models with RL</a>, <a href="https://arxiv.org/abs/2603.29620">a paper on Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-live/">Gemini 3.1 Flash Live: Making audio AI more natural and reliable</a></strong></p><p>Google launches Gemini 3.1 Flash Live, its highest-quality real-time audio model, scoring 90.8% on ComplexFuncBench Audio and 36.1% on AudioMultiChallenge</p><p><strong><a href="https://mistral.ai/news/voxtral-tts">Speaking of Voxtral</a></strong></p><p>Mistral launches Voxtral TTS, a 4B-parameter multilingual text-to-speech model supporting 9 languages with 70ms latency, voice cloning from 3-second samples, and more.</p><p><strong><a href="https://ai.meta.com/blog/segment-anything-model-3/">SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning</a></strong></p><p>Meta updates SAM 3 to SAM 3.1, adding object multiplexing to double video processing speed to 32 FPS on a single H100 for its open-source text-prompted segmentation and tracking model.</p><p><strong><a href="https://cohere.com/blog/transcribe">Cohere Transcribe: state-of-the-art speech recognition</a></strong></p><p>Cohere launches Transcribe, a 2B-parameter open-source ASR model that tops the HuggingFace Open ASR Leaderboard with a 5.42% average word error rate across 14 languages,</p><p><strong><a href="https://siliconangle.com/2026/03/25/granola-raises-125m-1-5b-valuation-ai-note-taking-app/">Granola raises $125M at $1.5B valuation for its AI note-taking app</a></strong></p><p>Granola raises $125M Series C at a $1.5B valuation led by Index Ventures, following a quarter of 250% revenue growth, with plans to expand its AI meeting notes app toward agentic task automation.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://developer.nvidia.com/blog/deploying-disaggregated-llm-inference-workloads-on-kubernetes/">Deploying Disaggregated LLM Inference Workloads on Kubernetes</a></strong></p><p>A technical guide to deploying disaggregated LLM inference (separate prefill, decode, and router services) on Kubernetes using NVIDIA Grove, KAI Scheduler, and NVIDIA Dynamo.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://zilliz.com/blog/choose-embedding-model-rag-2026">Best Embedding Model for RAG 2026: 10 Models Compared</a></strong></p><p>A practical benchmarking guide comparing 10 embedding models across four production-critical RAG dimensions &#8212; cross-modal, cross-lingual, long-document retrieval, and MRL compression</p><p><strong><a href="https://www.philschmid.de/kimi-composer-context">How Kimi, Cursor, and Chroma Train Agentic Models with RL</a></strong></p><p>A technical synthesis of three recent agentic RL training reports &#8212; Kimi K2.5, Cursor Composer 2, and Chroma Context-1 &#8212; distilling shared patterns around production-environment training, context management, and reward design.</p><p><strong><a href="https://weaviate.io/blog/multimodal-guide">Multimodal Embeddings and RAG: A Practical Guide</a></strong></p><p>A practical guide to multimodal embeddings and RAG covering the core theory (contrastive learning, modality gap, MRL), three concrete build patterns (audio, PDF, video), and when multimodal actually outperforms text-only pipelines.</p><p><strong><a href="https://cloud.google.com/blog/topics/developers-practitioners/five-techniques-to-reach-the-efficient-frontier-of-llm-inference">Five techniques to reach the efficient frontier of LLM inference</a></strong></p><p>A practical guide to LLM inference optimization framed around the &#8220;efficient frontier&#8221; concept &#8212; five techniques that move production systems toward the latency/throughput Pareto boundary without additional hardware spend.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/katanemo/plano">katanemo/plano</a></strong></p><p>Plano is an AI-native proxy and data plane for agentic apps &#8212; with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2603.29620">Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</a></strong></p><p><strong>Abstract:</strong></p><p>Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.</p><p><strong><a href="https://arxiv.org/abs/2603.12201">IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse</a></strong></p><p><strong>Abstract:</strong></p><p>Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L2) to O(Lk). However, the indexer itself retains O(L2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer&#8217;s top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82&#215; prefill speedup and 1.48&#215; decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 448]]></title><description><![CDATA[Cursor's Composer 2, TurboQuant: Redefining AI efficiency with extreme compression, a paper on Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-448</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-448</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 26 Mar 2026 15:03:09 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://cursor.com/blog/composer-2">Cursor&#8217;s Composer 2</a>, <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant: Redefining AI efficiency with extreme compression</a> and <a href="https://arxiv.org/abs/2602.02007">a paper on Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation</a>.</p><p>You may also enjoy <a href="https://www.anthropic.com/features/81k-interviews">What 81,000 people want from AI \ Anthropic</a>, <a href="https://opensearch.org/blog/evaluating-agentic-search-in-opensearch/">Evaluating agentic search in OpenSearch</a>, <a href="https://arxiv.org/abs/2603.20278">a paper on OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://cursor.com/blog/composer-2">Introducing Composer 2 &#183; Cursor</a></strong></p><p>Cursor launches Composer 2, a frontier-level coding model trained via continued pretraining and long-horizon RL that scores 61.3 on CursorBench and 73.7 on SWE-bench Multilingual</p><p><strong><a href="https://www.anthropic.com/features/81k-interviews">What 81,000 people want from AI \ Anthropic</a></strong></p><p>Anthropic&#8217;s largest-ever qualitative study &#8212; 80,508 Claude users across 159 countries and 70 languages &#8212; reveals what people want from AI, what they&#8217;ve already gotten, and what they fear.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/ai/lyria-3-pro/">Lyria 3 Pro: Create longer tracks in more Google products</a></strong></p><p>Google launches Lyria 3 Pro, an upgraded music generation model that produces tracks up to 3 minutes with structural song awareness (intros, verses, choruses, bridges).</p><p><strong><a href="https://allenai.org/blog/molmoweb">MolmoWeb: An open agent for automating web tasks</a></strong></p><p>Allen AI releases MolmoWeb, a fully open visual web agent built on Molmo 2 that scores 78.2% on WebVoyager and 73.7% on SWE-bench Multilingual, outperforming GPT-4o-based agents while releasing all weights, training data, and evaluation tools.</p><p><strong><a href="https://openai.com/index/openai-to-acquire-astral/">OpenAI to acquire Astral</a></strong></p><p>OpenAI acquires Astral &#8212; maker of Python developer tools uv, Ruff, and ty used by millions of developers &#8212; to deepen its Codex ecosystem.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://medium.com/pinterest-engineering/building-an-mcp-ecosystem-at-pinterest-d881eb4c16f1">Building an MCP Ecosystem at Pinterest</a></strong></p><p>Pinterest Engineering details how they scaled MCP from concept to a production ecosystem of domain-specific servers &#8212; Presto, Spark, Knowledge &#8212; with a central registry, two-layer auth, and 66,000 monthly invocations saving an estimated 7,000 engineer-hours per month.</p><p><strong><a href="https://cursor.com/blog/self-hosted-cloud-agents">Run cloud agents in your own infrastructure</a></strong></p><p>Cursor launches self-hosted cloud agents GA, keeping code and tool execution entirely within enterprise infrastructure while Cursor handles orchestration and inference.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant: Redefining AI efficiency with extreme compression</a></strong></p><p>Google Research releases TurboQuant, a KV cache quantization method that achieves 6x+ memory reduction to 3 bits with zero accuracy loss and 8x attention speedup on H100s.</p><p><strong><a href="https://opensearch.org/blog/evaluating-agentic-search-in-opensearch/">Evaluating agentic search in OpenSearch</a></strong></p><p>A technical deep-dive on how OpenSearch benchmarked its agentic search feature across search relevance (BEIR and BRIGHT datasets) and query execution accuracy (Spider dataset), powered by Claude Opus 4.6.</p><p><strong><a href="https://blog.skypilot.co/scaling-autoresearch/">Scaling Karpathy&#8217;s Autoresearch: What Happens When the Agent Gets a GPU Cluster</a></strong></p><p>A technical blog post on how SkyPilot scaled Karpathy&#8217;s autoresearch agent from 1 to 16 GPUs, enabling ~910 experiments in 8 hours.</p><p><strong><a href="https://blog.bytebytego.com/p/how-anthropics-claude-thinks">How Anthropic&#8217;s Claude Thinks - ByteByteGo Newsletter</a></strong></p><p>ByteByteGo breaks down Anthropic&#8217;s interpretability research into six concrete findings about how Claude actually thinks &#8212; from parallel math strategies to ahead-of-time poetry planning to a default-refusal circuit that misfires into hallucinations.</p><p><strong><a href="https://magazine.sebastianraschka.com/p/visual-attention-variants">A Visual Guide to Attention Variants in Modern LLMs</a></strong></p><p>A visual reference guide mapping seven attention variants &#8212; MHA, GQA, MLA, SWA, DeepSeek Sparse Attention, Gated Attention, and hybrid architectures &#8212; across the open-weight models currently using them in production.</p><p><strong><a href="https://cursor.com/blog/fast-regex-search">Fast regex search: indexing text for agent tools</a></strong></p><p>A technical deep-dive on how Cursor built a local sparse n-gram index to replace ripgrep for agent search &#8212; eliminating 15+ second grep latency in large monorepos by narrowing regex matches to a pre-filtered candidate set before full scanning.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/openai/teen-safety-policy-pack?tab=readme-ov-file">openai/teen-safety-policy-pack</a></strong></p><p>A set of prompt-based safety policies designed to create age-appropriate protections for teens.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2603.20278">OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</a></strong></p><p><strong>Abstract:</strong></p><p>Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy.</p><p><strong><a href="https://arxiv.org/abs/2602.02007">Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation</a></strong></p><p><strong>Abstract:</strong></p><p>Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-k similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader&#8217;s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 447]]></title><description><![CDATA[Mamba-3, Agent-native Architectures: How to Build Apps After Code Ends, a paper on Attention Residuals, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-447</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-447</guid><dc:creator><![CDATA[Deep Learning Weekly]]></dc:creator><pubDate>Thu, 19 Mar 2026 15:31:00 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.together.ai/blog/mamba-3">Mamba-3</a>, <a href="https://every.to/guides/agent-native">Agent-native Architectures: How to Build Apps After Code Ends</a> and <a href="https://arxiv.org/abs/2603.15031">a paper on Attention Residuals</a>.</p><p>You may also enjoy <a href="https://mistral.ai/news/mistral-small-4">Introducing Mistral Small 4</a>, <a href="https://aweers.de/blog/2026/rl-for-llms/">State of RL for reasoning LLMs</a>, <a href="https://arxiv.org/abs/2602.04261">a paper on Data Agents: Levels, State of the Art, and Open Problems</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.together.ai/blog/mamba-3">Mamba-3</a></strong></p><p>Together AI releases Mamba-3, an inference-first state space model that outperforms Mamba-2, Gated DeltaNet, and Transformer-based Llama-3.2-1B on end-to-end latency at the 1.5B scale.</p><p><strong><a href="https://mistral.ai/news/mistral-small-4">Introducing Mistral Small 4</a></strong></p><p>Mistral releases Small 4 &#8212; an open-source, 119B-parameter MoE model unifying reasoning, multimodal, and coding capabilities, delivering 40% lower latency and 3x higher throughput than its predecessor.</p><p><strong><a href="https://claude.com/blog/claude-builds-visuals">Claude builds interactive visuals right in your conversation</a></strong></p><p>Anthropic launches inline interactive charts, diagrams, and visualizations in Claude chat &#8212; available in beta across all plan types.</p><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/">Measuring Progress Towards AGI: A Cognitive Framework</a></strong></p><p>Google DeepMind releases a cognitive taxonomy paper proposing 10 human-grounded abilities to measure AGI progress, paired with a $200,000 Kaggle hackathon to crowdsource the missing benchmarks.</p><p><strong><a href="https://siliconangle.com/2026/03/13/gumloop-reels-50m-ai-automation-platform/">Gumloop reels in $50M for its AI automation platform</a></strong></p><p>Gumloop raises $50M Series B led by Benchmark &#8212; with participation from Shopify Ventures and Y Combinator &#8212; bringing total funding to $70M for its no-code, drag-and-drop AI agent automation platform.</p><p><strong><a href="https://siliconangle.com/2026/03/16/okta-unveils-new-framework-manage-ai-agents-upcoming-okta-ai-agents-platform/">Okta unveils new framework to manage AI agents and upcoming Okta for AI Agents platform</a></strong></p><p>Okta unveils a security blueprint for the agentic enterprise and announces its &#8220;Okta for AI Agents&#8221; platform &#8211; treating AI agents as governed, non-human identities with centralized access control and a kill switch for rogue agents.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://every.to/guides/agent-native">Agent-native Architectures: How to Build Apps After Code Ends</a></strong></p><p>A technical guide on building agent-native applications &#8212; software architectures where agents are first-class citizens, using atomic tools and outcome-driven loops instead of hardcoded workflows.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://aweers.de/blog/2026/rl-for-llms/">State of RL for reasoning LLMs</a></strong></p><p>A technical deep-dive surveying the evolution of reinforcement learning algorithms for reasoning LLMs (2024&#8211;2026), tracing the lineage from REINFORCE and PPO through GRPO and eight successor methods</p><p><strong><a href="https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/">Many SWE-bench-Passing PRs Would Not Be Merged into Main - METR</a></strong></p><p>METR researchers found that roughly half of SWE-bench Verified PRs that pass the automated grader would not actually be merged by real repo maintainers, with automated grader scores averaging 24 percentage points higher than maintainer merge rates.</p><p><strong><a href="https://blog.ml.cmu.edu/2026/03/17/lumberchunker-long-form-narrative-document-segmentation/">LumberChunker: Long-Form Narrative Document Segmentation</a></strong></p><p>An article about LumberChunker, a RAG chunking method that uses an LLM to detect semantic boundaries in long-form narrative documents, achieving DCG@20 of 62.1% on the GutenQA benchmark &#8212; outperforming all fixed-size and recursive baselines.</p><p><strong><a href="https://ai.stanford.edu/blog/vagen/">VAGEN: Teaching Vision-Language Models to Build World Models Through Reinforcement Learning</a></strong></p><p>A Stanford AI Lab research blog post about VAGEN, a reinforcement learning framework that trains 3B-parameter VLM agents to build internal world models via structured state estimation and transition predictions.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/tobi/qmd">tobi/qmd</a></strong></p><p>A mini CLI search engine for your docs, knowledge bases, meeting notes, and more.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2603.15031">Attention Residuals</a></strong></p><p><strong>Abstract:</strong></p><p>Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer&#8217;s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.</p><p>Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.</p><p><strong><a href="https://arxiv.org/abs/2602.04261">Data Agents: Levels, State of the Art, and Open Problems</a></strong></p><p><strong>Abstract:</strong></p><p>Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term &#8220;data agent&#8221; is currently used inconsistently, conflating simple query responsive assistants with aspirational fully autonomous &#8220;data scientists&#8221;. This ambiguity blurs capability boundaries and accountability, making it difficult for users, system builders, and regulators to reason about what a &#8220;data agent&#8221; can and cannot do.</p><p>In this tutorial, we propose the first hierarchical taxonomy of data agents from Level 0 (L0, no autonomy) to Level 5 (L5, full autonomy). Building on this taxonomy, we will introduce a lifecycleand level-driven view of data agents. We will (1) present the L0-L5 taxonomy and the key evolutionary leaps that separate simple assistants from truly autonomous data agents, (2) review representative L0-L2 systems across data management, preparation, and analysis, (3) highlight emerging Proto-L3 systems that strive to autonomously orchestrate end-to-end data workflows to tackle diverse and comprehensive data-related tasks under supervision, and (4) discuss forward-looking research challenges towards proactive (L4) and generative (L5) data agents. We aim to offer both a practical map of today&#8217;s systems and a research roadmap for the next decade of data-agent development.</p><p><strong><a href="https://arxiv.org/abs/2603.14473">AI Can Learn Scientific Taste</a></strong></p><p><strong>Abstract:</strong></p><p>Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist&#8217;s executive capability, while enhancing an AI&#8217;s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 446]]></title><description><![CDATA[Native Observability & Alerts for Your OpenClaw with Opik, Gemini Embedding 2, a paper on LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-446</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-446</guid><dc:creator><![CDATA[Deep Learning Weekly]]></dc:creator><pubDate>Thu, 12 Mar 2026 15:02:50 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.comet.com/site/blog/openclaw-observability/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=openclaw-observability/">Native Observability &amp; Alerts for Your OpenClaw with Opik</a>, <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/">Gemini Embedding 2</a>, and <a href="https://arxiv.org/abs/2603.03269">a paper on LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory</a>.</p><p>You may also enjoy <a href="https://openai.com/index/introducing-gpt-5-4/">GPT-5.4</a>, <a href="https://arxiv.org/abs/2601.18137">a paper on DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/">Gemini Embedding 2: Our first natively multimodal embedding model</a></strong></p><p>Google launches Gemini Embedding 2, its first natively multimodal embedding model unifying text, images, video, audio, and documents into a single semantic space.</p><p><strong><a href="https://openai.com/index/openai-to-acquire-promptfoo/">OpenAI to acquire Promptfoo</a></strong></p><p>OpenAI is acquiring Promptfoo &#8212; an AI security platform used by 25%+ of Fortune 500 companies &#8212; to embed red-teaming, jailbreak detection, and agentic risk evaluation natively into its enterprise Frontier platform.</p><p><strong><a href="https://openai.com/index/introducing-gpt-5-4/">Introducing GPT-5.4 | OpenAI</a></strong></p><p>OpenAI launches GPT-5.4 with a 1M-token context, new Tool Search API, and record scores on coding and knowledge-work benchmarks &#8212; its most capable frontier model for professional and agentic use.</p><p><strong><a href="https://venturebeat.com/orchestration/google-upgrades-gemini-for-workspace-allowing-it-to-pull-data-from-multiple">Google upgrades Gemini for Workspace allowing it to pull data from multiple apps to create Docs, Sheets, Slides and more</a></strong></p><p>Google lets Gemini generate fully-formed Docs, Sheets, and Slides by pulling from Gmail, Drive, and Chat &#8212; turning Workspace into a single-prompt content creation engine.</p><p><strong><a href="https://techcrunch.com/2026/03/09/yann-lecuns-ami-labs-raises-1-03-billion-to-build-world-models/">Yann LeCun&#8217;s AMI Labs raises $1.03B to build world models | TechCrunch</a></strong></p><p>Yann LeCun&#8217;s AMI Labs raises $1.03B at a $3.5B valuation to build JEPA-based world models &#8212; AI that learns from reality rather than language &#8212; with NVIDIA, Samsung, and Eric Schmidt among backers.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/openclaw-observability/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=openclaw-observability/">Native Observability &amp; Alerts for Your OpenClaw with Opik</a></strong></p><p>A blog post announcing opik-openclaw, a native OpenClaw plugin from Comet that adds full-stack observability &#8212; tracing every LLM call, tool execution, token cost, and sub-agent delegation &#8212; to address the visibility gap in autonomous agent workflows.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://openai.com/index/instruction-hierarchy-challenge/">Improving instruction hierarchy in frontier LLMs</a></strong></p><p>A technical research post about OpenAI&#8217;s IH-Challenge &#8212; an RL training dataset that teaches models a strict trust hierarchy (System &gt; Developer &gt; User &gt; Tool) to resist prompt injection, jailbreaks, and instruction conflicts.</p><p><strong><a href="https://huggingface.co/blog/nvidia/synthetic-code-concepts">Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds</a></strong></p><p>A technical blog post about NVIDIA&#8217;s concept-driven synthetic data pipeline that generated 15M Python programming problems, yielding a 6-point HumanEval gain (73&#8594;79) when included in Nemotron-Nano-v3 pretraining.</p><p><strong><a href="https://www.philschmid.de/testing-skills#1-create-a-prompt-set">Practical Guide to Evaluating and Testing Agent Skills</a></strong></p><p>A practical guide about building lightweight eval harnesses for agent skills, walking through how to define success criteria, construct prompt sets, and iterate &#8212; illustrated by taking a Gemini Interactions API skill from 66.7% to 100% pass rate.</p><p><strong><a href="https://ai.stanford.edu/blog/tos_stanford_blog/">Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?</a></strong></p><p>A Stanford benchmark revealing that frontier models (GPT-5.2, Gemini-3 Pro, Claude 4.5 Sonnet) all fail to build accurate, revisable cognitive maps during active spatial exploration &#8212; humans consistently outperform all of them.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/alibaba/page-agent">alibaba/page-agent</a></strong></p><p>JavaScript in-page GUI agent. Control web interfaces with natural language.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2603.03269">LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory</a></strong></p><p><strong>Abstract:</strong></p><p>Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.</p><p><strong><a href="https://arxiv.org/abs/2601.18137">DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints</a></strong></p><p><strong>Abstract:</strong></p><p>While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 445]]></title><description><![CDATA[Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems, Nano Banana 2: Combining Pro capabilities with lightning-fast speed, a paper on Beyond Language Modeling: An]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-445</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-445</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 05 Mar 2026 16:03:02 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.comet.com/site/blog/opik-claude-code-plugin/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=opik-claude-code-plugin/">Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems</a>, <a href="https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/">Nano Banana 2: Combining Pro capabilities with lightning-fast speed</a> and <a href="https://arxiv.org/abs/2603.03276">a paper on Beyond Language Modeling: An Exploration of Multimodal Pretraining</a>.</p><p>You may also enjoy <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/">Gemini 3.1 Flash-Lite</a>, <a href="https://news.mit.edu/2026/personalization-features-can-make-llms-more-agreeable-0218">Personalization features can make LLMs more agreeable</a>, <a href="https://arxiv.org/abs/2602.22661">a paper on dLLM: Simple Diffusion Language Modeling</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/">Gemini 3.1 Flash-Lite: Built for intelligence at scale</a></strong></p><p>Google launches Gemini 3.1 Flash-Lite in preview, positioning it as their fastest and most cost-efficient model yet at $0.25/1M input tokens &#8212; built specifically for high-volume developer workloads demanding both speed and reasoning.</p><p><strong><a href="https://openai.com/index/gpt-5-3-instant/">GPT-5.3 Instant: Smoother, more useful everyday conversations</a></strong></p><p>OpenAI releases GPT-5.3 Instant as the new default ChatGPT model, cutting hallucinations by up to 26.8% and dramatically reducing the over-cautious, &#8220;cringe&#8221; responses that frustrated everyday users.</p><p><strong><a href="https://www.anthropic.com/news/statement-department-of-war">Statement from Dario Amodei on our discussions with the Department of War</a></strong></p><p>Anthropic&#8217;s Dario Amodei publicly refuses Department of War demands to remove AI safeguards on mass domestic surveillance and fully autonomous weapons.</p><p><strong><a href="https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in">Did Alibaba just kneecap its powerful Qwen AI team? Key figures depart in wake of latest open source release</a></strong></p><p>Alibaba&#8217;s Qwen AI team loses its founding technical lead and two key researchers just 24 hours after shipping the Qwen3.5 small model series, raising alarm about the project&#8217;s open-source future and triggering a 5% drop in Alibaba&#8217;s stock.</p><p><strong><a href="https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/">Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model</a></strong></p><p>Microsoft releases Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that rivals much larger models on math, science, and computer-use tasks while requiring a fraction of the training compute.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/">Nano Banana 2: Combining Pro capabilities with lightning-fast speed</a></strong></p><p>Google launches Nano Banana 2 (Gemini 3.1 Flash Image), combining the advanced quality of Nano Banana Pro with Flash-level speed, rolling out across Gemini, Search, Google Ads, Vertex AI, and Flow.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/opik-claude-code-plugin/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=opik-claude-code-plugin/">Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems</a></strong></p><p>Announcing the new Opik Claude Code Plugin, which automatically instruments Python and JavaScript agent code with tracing, applies observability best practices, and logs what Claude Code is doing as it modifies a system.</p><p><strong><a href="https://cloud.google.com/blog/topics/developers-practitioners/improve-chatbot-memory-using-google-cloud">Improve chatbot memory using Google Cloud</a></strong></p><p>A practical guide about building scalable long-term memory for agentic chatbots using a three-tier polyglot storage architecture on Google Cloud (Redis, Bigtable, BigQuery).</p><h2><strong>Learning</strong></h2><p><strong><a href="https://news.mit.edu/2026/personalization-features-can-make-llms-more-agreeable-0218">Personalization features can make LLMs more agreeable</a></strong></p><p>MIT/Penn State research finds LLM personalization features significantly amplify sycophantic behavior, with memory-stored user profiles having the greatest effect across 4 of 5 models tested in real two-week user interactions.</p><p><strong><a href="https://montrealethics.ai/tech-futures-the-threat-to-digital-infrastructure/">The threat of AI-generated code to the world&#8217;s digital infrastructure</a></strong></p><p>An article about how AI-enabled &#8220;vibe contributing&#8221; &#8212; low-quality, AI-generated code submitted by novice contributors &#8212; is overwhelming volunteer open source maintainers and threatening the stability of global digital infrastructure.</p><p><strong><a href="https://research.google/blog/teaching-llms-to-reason-like-bayesians/">Teaching LLMs to reason like Bayesians</a></strong></p><p>A research blog post about how Google trained LLMs to reason like optimal Bayesian agents via fine-tuning on Bayesian model outputs, dramatically improving probabilistic belief-updating across domains.</p><p><strong><a href="https://huggingface.co/blog/moe-transformers">Mixture of Experts (MoEs) in Transformers</a></strong></p><p>A technical blog post about how Hugging Face redesigned the transformers library to make Mixture-of-Experts (MoE) models first-class citizens, covering weight loading, expert routing backends, parallelism, and training optimizations.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. https://github.com/comet-ml/opik</p><p><strong><a href="https://github.com/pydantic/monty">pydantic/monty</a></strong></p><p>A minimal, secure Python interpreter written in Rust for use by AI.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2602.22661">dLLM: Simple Diffusion Language Modeling</a></strong></p><p><strong>Abstract:</strong></p><p>Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures.</p><p>To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.</p><p><strong><a href="https://arxiv.org/abs/2603.03276">Beyond Language Modeling: An Exploration of Multimodal Pretraining</a></strong></p><p><strong>Abstract:</strong></p><p>The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 444]]></title><description><![CDATA[Gemini 3.1 Pro, A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026, a paper on Does Your Reasoning Model Implicitly Know When to Stop Thinking?, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-444</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-444</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 26 Feb 2026 16:01:52 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/">Gemini 3.1 Pro</a>, <a href="https://magazine.sebastianraschka.com/p/a-dream-of-spring-for-open-weight">A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026</a> and <a href="https://arxiv.org/abs/2602.08354">a paper on Does Your Reasoning Model Implicitly Know When to Stop Thinking?</a>.</p><p>You may also enjoy <a href="https://venturebeat.com/orchestration/anthropic-just-released-a-mobile-version-of-claude-code-called-remote">Anthropic&#8217;s Remote Control</a>, <a href="https://posthog.com/blog/optimizing-agent-cost">How we caught our AI agent embezzling tokens</a>, <a href="https://arxiv.org/abs/2602.21193">a paper on On Data Engineering for Scaling LLM Terminal Capabilities</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/">Gemini 3.1 Pro: A smarter model for your most complex tasks</a></strong></p><p>Google launches Gemini 3.1 Pro, claiming more than double the reasoning performance of its predecessor on complex logic benchmarks, now rolling out across developer, enterprise, and consumer products.</p><p><strong><a href="https://venturebeat.com/orchestration/anthropic-just-released-a-mobile-version-of-claude-code-called-remote">Anthropic just released a mobile version of Claude Code called Remote Control</a></strong></p><p>Anthropic launches Claude Code Remote Control, a new feature enabling developers to initiate coding sessions on their local terminal and seamlessly continue them from any mobile device or browser without moving code to the cloud.</p><p><strong><a href="https://cohere.com/blog/cohere-labs-tiny-aya">Cohere Labs Launches Tiny Aya, Making Multilingual AI Accessible</a></strong></p><p>Cohere Labs releases Tiny Aya, a 3.35B open-weight model claiming top multilingual performance in its size class across region-specific language variants.</p><p><strong><a href="https://cursor.com/blog/agent-computer-use">Cursor agents can now control their own computers</a></strong></p><p>Cursor launches cloud agents that run in isolated VMs with full computer-use capabilities, producing merge-ready PRs with video/screenshot artifacts to validate their work across web, mobile, Slack, and GitHub.</p><p><strong><a href="https://venturebeat.com/orchestration/visual-imitation-learning-guidde-trains-ai-agents-on-human-expert-video">Visual imitation learning: Guidde trains AI agents on human &#8216;expert video&#8217; instead of documentation</a></strong></p><p>Guidde raises $50M to train AI agents on expert screen-recording videos instead of static documentation, cutting video creation time by 41% and support tickets by 34%.</p><p><strong><a href="https://techcrunch.com/2026/02/25/the-public-opposition-to-ai-infrastructure-is-heating-up/">The public opposition to AI infrastructure is heating up</a></strong></p><p>Bipartisan opposition to AI data centers is escalating across the U.S., with states like New York proposing three-year construction moratoriums and communities pulling tax incentives, even as Big Tech commits $650B in infrastructure spending.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://posthog.com/blog/optimizing-agent-cost">How we caught our AI agent embezzling tokens</a></strong></p><p>A PostHog engineering deep-dive into how they traced, diagnosed, and reduced their AI Wizard agent&#8217;s $6.67/run inference cost &#8212; uncovering three &#8220;token embezzlement&#8221; patterns and counterintuitive findings about context management and caching.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://magazine.sebastianraschka.com/p/a-dream-of-spring-for-open-weight">A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026</a></strong></p><p>A comprehensive architectural deep-dive comparing 10 major open-weight LLM releases from January&#8211;February 2026, highlighting the convergence toward hybrid attention mechanisms and efficiency-first design across models ranging from 3B to 1T parameters.</p><p><strong><a href="https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d">MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix</a></strong></p><p>An engineering blog post about how Netflix built MediaFM, its first in-house tri-modal (audio, video, text) foundation model trained on tens of millions of catalog shots to power recommendations, ad relevancy, and promotional asset optimization at scale.</p><p><strong><a href="https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks">Detecting and preventing distillation attacks \ Anthropic</a></strong></p><p>Anthropic exposes three Chinese AI labs &#8212; DeepSeek, Moonshot, and MiniMax &#8212; for running industrial-scale &#8220;distillation attacks&#8221; that illicitly extracted Claude&#8217;s capabilities across 16M+ exchanges through ~24,000 fraudulent accounts.</p><p><strong><a href="https://epoch.ai/blog/expanding-our-analysis-of-biological-ai-models">Expanding our analysis of biological AI models | Epoch AI</a></strong></p><p>A comprehensive Epoch AI report cataloging 1,196 biological AI models across nine categories, revealing critical biosafety gaps and landscape trends commissioned by Sentinel Bio.</p><p><strong><a href="https://research.google/blog/teaching-ai-to-read-a-map/">Teaching AI to read a map</a></strong></p><p>Google Research introduces MapTrace, a fully automated synthetic data pipeline using Gemini and Imagen models to generate 2M annotated map path examples &#8212; teaching multimodal LLMs fine-grained spatial reasoning and reducing path-tracing error by 33% on real-world benchmarks.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/vxcontrol/pentagi">vxcontrol/pentagi</a></strong></p><p>Fully autonomous AI Agents system capable of performing complex penetration testing tasks</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2602.08354">Does Your Reasoning Model Implicitly Know When to Stop Thinking?</a></strong></p><p><strong>Abstract:</strong></p><p>Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.</p><p><strong><a href="https://arxiv.org/abs/2602.21193">On Data Engineering for Scaling LLM Terminal Capabilities</a></strong></p><p><strong>Abstract:</strong></p><p>Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 443]]></title><description><![CDATA[Optimizing AI IDEs at Scale, What do &#8220;economic value&#8221; benchmarks tell us, a paper on MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-443</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-443</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 19 Feb 2026 17:03:10 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.comet.com/site/blog/optimize-ai-ide-cost/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=optimize-ai-ide-cost/">Optimizing AI IDEs at Scale</a>, <a href="https://epoch.ai/blog/what-do-economic-value-benchmarks-tell-us">What do &#8220;economic value&#8221; benchmarks tell us?</a> and <a href="https://arxiv.org/abs/2602.02474">a paper on MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents</a>.</p><p>You may also enjoy <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/">Gemini 3 Deep Think: Advancing science, research and engineering</a>, <a href="https://huggingface.co/blog/openenv-turing">OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments</a>, <a href="https://openreview.net/forum?id=tq9lyV9Cml">a paper on Thought Communication in Multiagent Collaboration</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.anthropic.com/news/claude-sonnet-4-6">Introducing Claude Sonnet 4.6</a></strong></p><p>Anthropic launches Claude Sonnet 4.6 as the new default model across all plans, featuring a 1M token context window, major computer use improvements, and Opus-level performance on many tasks at the same $3/$15 per million token price as Sonnet 4.5.</p><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/">Gemini 3 Deep Think: Advancing science, research and engineering</a></strong></p><p>Google announces a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode targeting frontier science, math, and engineering &#8212; setting new benchmark records and opening early API access to researchers and enterprises.</p><p><strong><a href="https://www.cnbc.com/2026/02/17/china-alibaba-qwen-ai-agent-latest-model.html">Alibaba unveils Qwen3.5 as China&#8217;s chatbot race shifts to AI agents</a></strong></p><p>Alibaba launches Qwen 3.5 &#8212; a 397B-parameter, natively multimodal open-weight model built for agentic AI &#8212; as China&#8217;s frontier model race intensifies ahead of an expected DeepSeek release.</p><p><strong><a href="https://siliconangle.com/2026/02/17/ai-agent-reliability-startup-temporal-raises-300m-funding/">AI agent reliability startup Temporal raises $300M in funding</a></strong></p><p>Temporal raises $300M Series D at a $5B valuation, led by a16z, to scale its open-source platform that makes AI agents fault-tolerant by logging every action and enabling automatic recovery from failures.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.comet.com/site/blog/optimize-ai-ide-cost/?utm_source=substack&amp;utm_medium=email&amp;utm_campaign=dlw&amp;utm_content=optimize-ai-ide-cost/">Optimizing AI IDEs at Scale</a></strong></p><p>A blog post detailing how Comet&#8217;s engineering team traced rising AI IDE spend to bloated context windows and always-on agent rules, then reduced token overhead by shrinking default context, modularizing skills, and tightening evaluation loops.</p><p><strong><a href="https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194">Scaling LLM Post-Training at Netflix</a></strong></p><p>A technical blog post about how Netflix built an internal LLM post-training framework using Ray-based distributed orchestration to scale fine-tuning and RL workflows across multi-node GPU clusters for recommendation, search, and personalization.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://huggingface.co/blog/openenv-turing">OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments</a></strong></p><p>A technical blog post about OpenEnv, an open-source agent evaluation framework, and findings from testing tool-using agents in a production-grade calendar benchmark &#8212; revealing that ambiguity and multi-step chaining, not tool selection, are the primary failure modes.</p><p><strong><a href="https://www.seangoedecke.com/fast-llm-inference/">Two different tricks for fast LLM inference</a></strong></p><p>A technical blog post comparing Anthropic&#8217;s and OpenAI&#8217;s &#8220;fast mode&#8221; inference approaches &#8212; low-batch-size serving vs. Cerebras wafer-scale chips &#8212; and arguing that accuracy, not raw speed, remains the dominant factor in agentic AI value.</p><p><strong><a href="https://milvus.io/blog/we-extracted-openclaws-memory-system-and-opensourced-it-memsearch.md">We Extracted OpenClaw&#8217;s Memory System and Open-Sourced It (memsearch)</a></strong></p><p>A technical blog post about how Zilliz extracted OpenClaw&#8217;s transparent, Markdown-based long-term memory architecture and open-sourced it as memsearch &#8212; a standalone, framework-agnostic memory library backed by Milvus vector search.</p><p><strong><a href="https://epoch.ai/blog/what-do-economic-value-benchmarks-tell-us">What do &#8220;economic value&#8221; benchmarks tell us? | Epoch AI</a></strong></p><p>A research report analyzing three &#8220;economic value&#8221; benchmarks that measure AI performance on real-world digital work tasks, concluding that high scores signal meaningful task-level acceleration but fall short of implying end-to-end job automation.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/vercel-labs/json-render">vercel-labs/json-render</a></strong></p><p>json-render is a Generative UI framework: AI generates interfaces from natural language prompts, constrained to components you define.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2602.02474">MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents</a></strong></p><p><strong>Abstract:</strong></p><p>Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.</p><p><strong><a href="https://openreview.net/forum?id=tq9lyV9Cml">Thought Communication in Multiagent Collaboration</a></strong></p><p><strong>Abstract:</strong></p><p>Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 442]]></title><description><![CDATA[Claude Opus 4.6, Harness engineering: leveraging Codex in an agent-first world, a paper on Weak-Driven Learning: How Weak Agents make Strong Agents Stronger, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-442</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-442</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 12 Feb 2026 16:02:28 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://www.anthropic.com/news/claude-opus-4-6">Claude Opus 4.6</a>, <a href="https://openai.com/index/harness-engineering/">Harness engineering: leveraging Codex in an agent-first world</a> and <a href="https://arxiv.org/abs/2602.08222">a paper on Weak-Driven Learning: How Weak Agents make Strong Agents Stronger</a>.</p><p>You may also enjoy <a href="https://openai.com/index/introducing-gpt-5-3-codex/">GPT-5.3-Codex</a>, <a href="https://research.google/blog/beyond-one-on-one-authoring-simulating-and-testing-dynamic-human-ai-group-conversations/">Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations</a>, <a href="https://arxiv.org/abs/2602.08234">a paper on SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://www.anthropic.com/news/claude-opus-4-6">Introducing Claude Opus 4.6</a></strong></p><p>Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, a 1M-token context window, and industry-leading scores on Terminal-Bench 2.0, Humanity&#8217;s Last Exam, and GDPval-AA.</p><p><strong><a href="https://openai.com/index/introducing-gpt-5-3-codex/">Introducing GPT-5.3-Codex | OpenAI</a></strong></p><p>OpenAI launches GPT-5.3-Codex &#8212; its first self-bootstrapped model that helped debug its own training &#8212; combining GPT-5.2&#8217;s reasoning with frontier coding performance at 25% faster speeds.</p><p><strong><a href="https://siliconangle.com/2026/02/10/world-model-startup-runway-closes-315m-funding-round/">World model startup Runway closes $315M funding round</a></strong></p><p>Runway closes a $315M Series E led by General Atlantic at a $5.3B valuation, with backing from NVIDIA and AMD, to advance its world models for 3D environment generation used in robotics simulation and video production.</p><p><strong><a href="https://venturebeat.com/orchestration/openai-upgrades-its-responses-api-to-support-agent-skills-and-a-complete">OpenAI upgrades its Responses API to support agent skills and a complete terminal shell</a></strong></p><p>An article about OpenAI adding server-side compaction, hosted shell containers, and the open &#8220;Skills&#8221; standard to its Responses API, enabling agents to handle 5M+ token sessions without context degradation.</p><h2><strong>MLOps/LLMOps</strong></h2><p><strong><a href="https://www.pinecone.io/blog/millions-at-stake-melange/#The-Infrastructure-Problem-Behind-the-Accuracy-Problem">Millions at Stake: How Melange&#8217;s High-Recall Retrieval Prevents Litigation Collapse</a></strong></p><p>A case study about how patent analytics company Melange uses Pinecone&#8217;s vector database to achieve 99% recall across 600M+ documents, saving $75K annually while preventing million-dollar litigation risks from missed prior art.</p><p><strong><a href="https://openai.com/index/harness-engineering/">Harness engineering: leveraging Codex in an agent-first world</a></strong></p><p>An engineering post on how OpenAI built a million-line codebase with zero hand-written code using a 3-engineer team driving Codex agents at 3.5 PRs/engineer/day, redefining the developer role as harness design over direct coding.</p><p><strong><a href="https://venturebeat.com/data/observational-memory-cuts-ai-agent-costs-10x-and-outscores-rag-on-long">&#8216;Observational memory&#8217; cuts AI agent costs 10x and outscores RAG on long-context benchmarks</a></strong></p><p>An article about Mastra&#8217;s open-source &#8220;observational memory&#8221; architecture that uses Observer and Reflector agents to compress conversation history into stable, cacheable context &#8212; scoring 94.87% on LongMemEval while cutting token costs 10x versus traditional RAG.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/beyond-one-on-one-authoring-simulating-and-testing-dynamic-human-ai-group-conversations/">Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations</a></strong></p><p>A blog post on DialogLab, Google&#8217;s open-source framework for designing and testing multi-party human-AI group conversations with configurable roles, turn-taking rules, and a human-in-the-loop control mode.</p><p><strong><a href="https://milvus.io/blog/openclaw-formerly-clawdbot-moltbot-explained-a-complete-guide-to-the-autonomous-ai-agent.md">What Is OpenClaw? Complete Guide to the Open-Source AI Agent</a></strong></p><p>A guide to OpenClaw, the open-source, self-hosted AI agent that surpassed 175K GitHub stars in under two weeks by enabling autonomous task execution through messaging apps like WhatsApp, Telegram, and Slack.</p><p><strong><a href="https://www.anthropic.com/research/AI-assistance-coding-skills">How AI assistance impacts the formation of coding skills \ Anthropic</a></strong></p><p>A randomized controlled trial showing AI coding assistance decreased skill mastery by 17% among 52 software engineers, with debugging abilities most affected despite minimal productivity gains.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/google/A2UI">google/A2UI</a></strong></p><p>A2UI is an open-source project that allows agents to generate or populate rich user interfaces.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2602.08222">Weak-Driven Learning: How Weak Agents make Strong Agents Stronger</a></strong></p><p><strong>Abstract:</strong></p><p>As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models&#8217; own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.</p><p><strong><a href="https://arxiv.org/abs/2602.08234">SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning</a></strong></p><p><strong>Abstract:</strong></p><p>Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent&#8217;s policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 441]]></title><description><![CDATA[Qwen3-Coder-Next, Inside OpenAI&#8217;s in-house data agent, a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text, and many more!]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-441</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-441</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 05 Feb 2026 16:01:54 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://qwen.ai/blog?id=qwen3-coder-next">Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding</a>, <a href="https://openai.com/index/inside-our-in-house-data-agent/">Inside OpenAI&#8217;s in-house data agent</a> and <a href="https://arxiv.org/abs/2601.22975">a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text</a>.</p><p>You may also enjoy <a href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/">Project Genie: Experimenting with infinite, interactive worlds</a>, <a href="https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/">Towards a science of scaling agent systems: When and why agent systems work</a>, <a href="https://arxiv.org/abs/2601.23265">a paper on PaperBanana: Automating Academic Illustration for AI Scientists</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://qwen.ai/blog?id=qwen3-coder-next">Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding</a></strong></p><p>Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.</p><p><strong><a href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/">Project Genie: Experimenting with infinite, interactive worlds</a></strong></p><p>Google launches Project Genie, an experimental world model powered by Genie 3 that lets Google AI Ultra subscribers create and explore infinite, interactive environments in real-time using text and image prompts.</p><p><strong><a href="https://venturebeat.com/infrastructure/vercel-rebuilt-v0-to-tackle-the-90-problem-connecting-ai-generated-code-to">Vercel rebuilt v0 to tackle the 90% problem: Connecting AI-generated code to existing production infrastructure, not prototypes</a></strong></p><p>A news article reporting Vercel&#8217;s complete rebuild of v0 to address the &#8220;90% problem&#8221; where AI-generated code fails to integrate with existing production infrastructure.</p><p><strong><a href="https://mistral.ai/news/voxtral-transcribe-2">Voxtral transcribes at the speed of sound. | Mistral AI</a></strong></p><p>A product announcement for Mistral&#8217;s Voxtral Transcribe 2, featuring state-of-the-art speech-to-text with speaker diarization at $0.003/min and Voxtral Realtime with sub-200ms latency for live transcription.</p><h2><strong>MLOps &amp; LLMOps</strong></h2><p><strong><a href="https://openai.com/index/inside-our-in-house-data-agent/">Inside OpenAI&#8217;s in-house data agent</a></strong></p><p>OpenAI&#8217;s internal data agent powered by GPT-5.2 enables natural language queries across 600+ petabytes and 70,000 datasets, using multi-layered context and self-correction to deliver trustworthy analytics in minutes.</p><p><strong><a href="https://weaviate.io/blog/limit-in-the-loop">The Limit in the Loop</a></strong></p><p>A blog post arguing AI memory requires active maintenance infrastructure with six core functions to prevent accumulated noise from degrading agent performance over time.</p><p><strong><a href="https://www.philschmid.de/acp-overview">The Agent Client Protocol Overview</a></strong></p><p>A technical overview of the Agent Client Protocol (ACP), an open JSON-RPC 2.0 standard that provides a common interface for editors to interact with AI coding agents.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/">Towards a science of scaling agent systems: When and why agent systems work</a></strong></p><p>A research article presenting Google&#8217;s evaluation of 180 agent configurations, revealing multi-agent systems boost parallelizable tasks by 81% but degrade sequential tasks by 70%.</p><p><strong><a href="https://www.astralcodexten.com/p/moltbook-after-the-first-weekend?hide_intro_popup=true">Moltbook: After The First Weekend - by Scott Alexander</a></strong></p><p>Scott Alexander examines whether Moltbook AI activity is &#8220;real&#8221; or &#8220;roleplay&#8221; by evaluating external causes and effects.</p><p><strong><a href="https://alignment.anthropic.com/2026/hot-mess-of-ai/">The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?</a></strong></p><p>A research article from Anthropic finding AI failures increasingly stem from incoherence rather than systematic misalignment as tasks grow harder, suggesting future risks resemble industrial accidents more than coherent goal pursuit.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/jezweb/claude-skills">jezweb/claude-skills</a></strong></p><p>Skills for Claude Code CLI such as full stack dev Cloudflare, React, Tailwind v4, and AI integrations.</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2601.22975">Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text</a></strong></p><p><strong>Abstract:</strong></p><p>Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.</p><p><strong><a href="https://arxiv.org/abs/2601.23265">PaperBanana: Automating Academic Illustration for AI Scientists</a></strong></p><p><strong>Abstract:</strong></p><p>Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.</p><p><strong><a href="https://arxiv.org/abs/2602.01785">CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding</a></strong></p><p><strong>Abstract:</strong></p><p>Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Weekly: Issue 440]]></title><description><![CDATA[Terminally online Mistral Vibe, ATLAS: Practical scaling laws for multilingual models, a paper on GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization, and m]]></description><link>https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-440</link><guid isPermaLink="false">https://www.deeplearningweekly.com/p/deep-learning-weekly-issue-440</guid><dc:creator><![CDATA[Miko Planas]]></dc:creator><pubDate>Thu, 29 Jan 2026 16:02:09 GMT</pubDate><enclosure url="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5f87e5bf-9ea0-4fec-a25d-f48477452317_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week in deep learning, we bring you <a href="https://mistral.ai/news/mistral-vibe-2-0">Terminally online Mistral Vibe.</a>, <a href="https://research.google/blog/atlas-practical-scaling-laws-for-multilingual-models/">ATLAS: Practical scaling laws for multilingual models</a> and <a href="https://arxiv.org/abs/2601.05242">a paper on GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</a>.</p><p>You may also enjoy <a href="https://siliconangle.com/2026/01/27/moonshot-ai-releases-open-source-kimi-k2-5-model-1t-parameters/">Moonshot AI releases open-source Kimi K2.5 model with 1T parameters</a>, <a href="https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151">The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language</a>, <a href="https://arxiv.org/abs/2601.08763">a paper on Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</a>, and more!</p><p>As always, happy reading and hacking. If you have something you think should be in next week&#8217;s issue, find us on Twitter: <a href="https://twitter.com/dl_weekly">@dl_weekly</a>.</p><p>Until next week!</p><div><hr></div><h2><strong>Industry</strong></h2><p><strong><a href="https://mistral.ai/news/mistral-vibe-2-0">Terminally online Mistral Vibe.</a></strong></p><p>Mistral launches Vibe 2.0, a terminal-native coding agent powered by Devstral 2, featuring custom subagents, multi-choice clarifications, and slash-command skills.</p><p><strong><a href="https://siliconangle.com/2026/01/27/moonshot-ai-releases-open-source-kimi-k2-5-model-1t-parameters/">Moonshot AI releases open-source Kimi K2.5 model with 1T parameters</a></strong></p><p>Moonshot AI releases open-source Kimi K2.5, a 1 trillion parameter mixture-of-experts model trained on 15 trillion tokens that outperforms GPT-5.2 on several benchmarks including the challenging HLE-Full evaluation.</p><p><strong><a href="https://techcrunch.com/2026/01/27/node-based-design-tool-flora-raises-42m-from-redpoint-ventures/">Node-based design tool Flora raises $42M from Redpoint Ventures</a></strong></p><p>Flora, an AI-powered design platform, raises $42M Series A led by Redpoint Ventures to democratize creative workflows through multimodal generative AI and infinite canvas collaboration.</p><p><strong><a href="https://ampcode.com/news/deep-mode">Go Deep - Amp</a></strong></p><p>Amp launches &#8220;deep&#8221; mode powered by GPT-5.2-Codex, a highly autonomous coding agent that silently researches codebases for 5-15 minutes before making changes, complementing their interactive &#8220;smart&#8221; mode for different workflow needs.</p><p><strong><a href="https://blogs.nvidia.com/blog/nvidia-earth-2-open-models/">NVIDIA Launches Earth-2 Family of Open Models &#8212; the World&#8217;s First Fully Open, Accelerated Set of Models and Tools for AI Weather</a></strong></p><p>NVIDIA launches Earth-2 family of open weather AI models&#8212;the world&#8217;s first fully open, accelerated weather forecasting stack&#8212;offering models for 15-day global forecasts, local storm prediction, and data assimilation that run up to 500x faster than traditional physics-based methods.</p><p><strong><a href="https://allenai.org/blog/open-coding-agents">Open Coding Agents: Fast, accessible coding agents that adapt to any repo | Ai2</a></strong></p><p>Allen Institute for AI launches Open Coding Agents featuring SERA, an open-source coding agent, enabling repository-specific specialization where 32B models match 100B+ teachers on private codebases.</p><h2><strong>MLOps &amp; LLMOps</strong></h2><p><strong><a href="https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151">The AI Evolution of Graph Search at Netflix From Structured Queries to Natural Language</a></strong></p><p>A technical blog post detailing Netflix&#8217;s implementation of LLM-powered natural language search for their Graph Search platform, transforming structured GraphQL queries into intuitive text-based interfaces for enterprise data discovery.</p><h2><strong>Learning</strong></h2><p><strong><a href="https://research.google/blog/atlas-practical-scaling-laws-for-multilingual-models/">ATLAS: Practical scaling laws for multilingual models</a></strong></p><p>Google Research introduces ATLAS (Adaptive Transfer Scaling Laws), the largest public multilingual pre-training study with 774 training runs across 400+ languages.</p><p><strong><a href="https://www.arcee.ai/blog/trinity-large">Arcee AI | Trinity Large: An Open 400B Sparse MoE Model</a></strong></p><p>A technical deep-dive on Arcee AI&#8217;s Trinity Large, a 400B parameter sparse MoE model with 13B active parameters achieving frontier-class performance at 2-3x faster inference than peers, trained in 33 days for $20M total cost.</p><p><strong><a href="https://mitsloan.mit.edu/ideas-made-to-matter/ai-open-models-have-benefits-so-why-arent-they-more-widely-used">AI open models have benefits. So why aren&#8217;t they more widely used?</a></strong></p><p>A research article examining why open AI models, despite achieving 90% of closed-model performance at 87% lower cost, account for only 20% of usage while closed models dominate most of the market.</p><h2><strong>Libraries &amp; Code</strong></h2><p><strong><a href="https://github.com/comet-ml/opik">comet-ml/opik</a></strong></p><p>An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</p><p><strong><a href="https://github.com/aiming-lab/SimpleMem">aiming-lab/SimpleMem</a></strong></p><p>Efficient Lifelong Memory for LLM Agents</p><h2><strong>Papers &amp; Publications</strong></h2><p><strong><a href="https://arxiv.org/abs/2601.05242">GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</a></strong></p><p><strong>Abstract:</strong></p><p>As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.</p><p><strong><a href="https://arxiv.org/abs/2601.08763">Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</a></strong></p><p><strong>Abstract:</strong></p><p>Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.deeplearningweekly.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>