Deep Learning Weekly: Issue 443
Optimizing AI IDEs at Scale, What do “economic value” benchmarks tell us, a paper on MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents, and many more!
This week in deep learning, we bring you Optimizing AI IDEs at Scale, What do “economic value” benchmarks tell us? and a paper on MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents.
You may also enjoy Gemini 3 Deep Think: Advancing science, research and engineering, OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments, a paper on Thought Communication in Multiagent Collaboration, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Anthropic launches Claude Sonnet 4.6 as the new default model across all plans, featuring a 1M token context window, major computer use improvements, and Opus-level performance on many tasks at the same $3/$15 per million token price as Sonnet 4.5.
Gemini 3 Deep Think: Advancing science, research and engineering
Google announces a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode targeting frontier science, math, and engineering — setting new benchmark records and opening early API access to researchers and enterprises.
Alibaba unveils Qwen3.5 as China’s chatbot race shifts to AI agents
Alibaba launches Qwen 3.5 — a 397B-parameter, natively multimodal open-weight model built for agentic AI — as China’s frontier model race intensifies ahead of an expected DeepSeek release.
AI agent reliability startup Temporal raises $300M in funding
Temporal raises $300M Series D at a $5B valuation, led by a16z, to scale its open-source platform that makes AI agents fault-tolerant by logging every action and enabling automatic recovery from failures.
MLOps/LLMOps
A blog post detailing how Comet’s engineering team traced rising AI IDE spend to bloated context windows and always-on agent rules, then reduced token overhead by shrinking default context, modularizing skills, and tightening evaluation loops.
Scaling LLM Post-Training at Netflix
A technical blog post about how Netflix built an internal LLM post-training framework using Ray-based distributed orchestration to scale fine-tuning and RL workflows across multi-node GPU clusters for recommendation, search, and personalization.
Learning
OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments
A technical blog post about OpenEnv, an open-source agent evaluation framework, and findings from testing tool-using agents in a production-grade calendar benchmark — revealing that ambiguity and multi-step chaining, not tool selection, are the primary failure modes.
Two different tricks for fast LLM inference
A technical blog post comparing Anthropic’s and OpenAI’s “fast mode” inference approaches — low-batch-size serving vs. Cerebras wafer-scale chips — and arguing that accuracy, not raw speed, remains the dominant factor in agentic AI value.
We Extracted OpenClaw’s Memory System and Open-Sourced It (memsearch)
A technical blog post about how Zilliz extracted OpenClaw’s transparent, Markdown-based long-term memory architecture and open-sourced it as memsearch — a standalone, framework-agnostic memory library backed by Milvus vector search.
What do “economic value” benchmarks tell us? | Epoch AI
A research report analyzing three “economic value” benchmarks that measure AI performance on real-world digital work tasks, concluding that high scores signal meaningful task-level acceleration but fall short of implying end-to-end job automation.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
json-render is a Generative UI framework: AI generates interfaces from natural language prompts, constrained to components you define.
Papers & Publications
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Abstract:
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.
Thought Communication in Multiagent Collaboration
Abstract:
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.


