Deep Learning Weekly: Issue 461
Advanced Claude Code Cost Tracking: How to Save 30% on Token Spend, Introducing Claude Tag, a paper on Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs, and many more!
This week in deep learning, we bring you Advanced Claude Code Cost Tracking: How to Save 30% on Token Spend , Introducing Claude Tag, and a paper on Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs.
You may also enjoy, CoT Monitoring: Where Does a Hot Safety Problem Come From?, A Guide to Inference Engineering, a paper on Are We Ready For An Agent-Native Memory System?, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Xiaomi’s HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most
Xiaomi researchers introduce HarnessX, a framework that treats an AI agent’s harness as a composable, self-evolving object, letting it autonomously rewrite its own scaffolding mid-task rather than relying on manual updates.
Introducing Claude Tag \ Anthropic
Anthropic launches Claude Tag, a Slack-native @Claude teammate that learns channel context, works asynchronously, and now powers 65% of the product team’s code at Anthropic internally.
Introducing computer use in Gemini 3.5 Flash
Google makes computer use a native built-in tool in Gemini 3.5 Flash, consolidating its previously standalone screen-control model into the main Flash model with new enterprise prompt-injection safeguards.
Qwen-AgentWorld: Language World Models for General Agents
Qwen releases Qwen-AgentWorld, a native language world model trained to simulate seven agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within one model, outperforming GPT-5.4 and Claude Opus 4.8 on the new AgentWorldBench evaluation.
Mistral OCR 4 : SOTA OCR for Document Intelligence
Mistral releases OCR 4, a self-hostable document-intelligence model adding bounding boxes, block classification, and confidence scores, beating competitors in human preference testing and topping OlmOCRBench.
MLOps/LLMOps/AgentOps
Advanced Claude Code Cost Tracking: How to Save 30% on Token Spend
Comet introduces Cost Intelligence, a Claude Code and Codex cost tracker that attributes AI spend across developers, projects, tools, and workflows while surfacing configuration changes that reduce token usage.
Version-Controlling Your Agents: Deployment, Rollback, and Safe Promotion Patterns
A practical guide arguing that AI agents need the same versioning, staged promotion, and rollback discipline as traditional software, treating their configs as immutable, deployable “Agent-as-Code” artifacts.
A guide to loop engineering, an agent design pattern that replaces prompt-centric workflows with iterative execution loops that manage tool use, state, planning, and reflection.
A Guide to AI Inference Engineering
A guide explaining LLM inference engineering through the prefill/decode split, breaking down six core optimization techniques and when self-hosting beats off-the-shelf APIs.
Learning
How Evaluation-Driven Development (EDD) Works
Learn how retrieval, memory, and context management have emerged as critical infrastructure for AI agents, often having a greater impact on system performance than the underlying model itself.
CoT Monitoring: Where Does a Hot Safety Problem Come From?
A reflective essay tracing the intellectual lineage of chain-of-thought (CoT) monitoring, arguing it emerged from the convergence of ML monitoring practices and CoT-as-explainability research rather than as a standalone idea.
Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix
A blog post about Netflix Research introducing Vera, a layered video diffusion model, and VOID, a physics-aware inpainting model, both aimed at giving artists more precise, controllable AI-assisted video editing.
Healthcare Benchmarks Are Only as Good as Their Assumptions
A research blog post arguing that healthcare LLM benchmarks fail to predict real-world performance because they embed unstated assumptions about task structure and outcome measurement that break down at deployment.
Patterns for Building Cybersecurity Evals
A guide breaking down cybersecurity evals into four shared primitives (sandboxed target, difficulty-tuning inputs, tools, grader), then walking through seven benchmarks from CTF-style exploitation to 50-host network compromise.
Libraries & Code
An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
MiMoCode is a terminal-native AI coding assistant.
Papers & Publications
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
Abstract:
While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model’s parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
Are We Ready For An Agent-Native Memory System?
Abstract:
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems.


