Deep Learning Weekly: Issue 442
Claude Opus 4.6, Harness engineering: leveraging Codex in an agent-first world, a paper on Weak-Driven Learning: How Weak Agents make Strong Agents Stronger, and many more!
This week in deep learning, we bring you Claude Opus 4.6, Harness engineering: leveraging Codex in an agent-first world and a paper on Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.
You may also enjoy GPT-5.3-Codex, Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations, a paper on SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, a 1M-token context window, and industry-leading scores on Terminal-Bench 2.0, Humanity’s Last Exam, and GDPval-AA.
Introducing GPT-5.3-Codex | OpenAI
OpenAI launches GPT-5.3-Codex — its first self-bootstrapped model that helped debug its own training — combining GPT-5.2’s reasoning with frontier coding performance at 25% faster speeds.
World model startup Runway closes $315M funding round
Runway closes a $315M Series E led by General Atlantic at a $5.3B valuation, with backing from NVIDIA and AMD, to advance its world models for 3D environment generation used in robotics simulation and video production.
OpenAI upgrades its Responses API to support agent skills and a complete terminal shell
An article about OpenAI adding server-side compaction, hosted shell containers, and the open “Skills” standard to its Responses API, enabling agents to handle 5M+ token sessions without context degradation.
MLOps/LLMOps
Millions at Stake: How Melange’s High-Recall Retrieval Prevents Litigation Collapse
A case study about how patent analytics company Melange uses Pinecone’s vector database to achieve 99% recall across 600M+ documents, saving $75K annually while preventing million-dollar litigation risks from missed prior art.
Harness engineering: leveraging Codex in an agent-first world
An engineering post on how OpenAI built a million-line codebase with zero hand-written code using a 3-engineer team driving Codex agents at 3.5 PRs/engineer/day, redefining the developer role as harness design over direct coding.
‘Observational memory’ cuts AI agent costs 10x and outscores RAG on long-context benchmarks
An article about Mastra’s open-source “observational memory” architecture that uses Observer and Reflector agents to compress conversation history into stable, cacheable context — scoring 94.87% on LongMemEval while cutting token costs 10x versus traditional RAG.
Learning
Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations
A blog post on DialogLab, Google’s open-source framework for designing and testing multi-party human-AI group conversations with configurable roles, turn-taking rules, and a human-in-the-loop control mode.
What Is OpenClaw? Complete Guide to the Open-Source AI Agent
A guide to OpenClaw, the open-source, self-hosted AI agent that surpassed 175K GitHub stars in under two weeks by enabling autonomous task execution through messaging apps like WhatsApp, Telegram, and Slack.
How AI assistance impacts the formation of coding skills \ Anthropic
A randomized controlled trial showing AI coding assistance decreased skill mastery by 17% among 52 software engineers, with debugging abilities most affected despite minimal productivity gains.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
A2UI is an open-source project that allows agents to generate or populate rich user interfaces.
Papers & Publications
Weak-Driven Learning: How Weak Agents make Strong Agents Stronger
Abstract:
As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models’ own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Abstract:
Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent’s policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases.


