Deep Learning Weekly: Issue 457

DeepSWE, The Best AI Observability Tools for Agentic Systems in 2026, a paper on SkillOpt: Executive Strategy for Self-Evolving Agent Skills, and many more!

May 29, 2026

This week in deep learning, we bring you DeepSWE, The Best AI Observability Tools for Agentic Systems in 2026 and a paper on SkillOpt: Executive Strategy for Self-Evolving Agent Skills.

You may also enjoy Google reimagines search with AI agents and generative interfaces, Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models, a paper on Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

DeepSWE

Datacurve introduces DeepSWE, a contamination-free coding benchmark of 113 from-scratch tasks across 91 repos and 5 languages, where GPT-5.5 leads at 70% and frontier models separate far more sharply than on SWE-Bench Pro.

Google reimagines search with AI agents and generative interfaces

Google overhauls Search at I/O 2026 with always-on Search Agents that monitor the web and report back, plus generative UI that builds interactive mini-apps on the fly via Antigravity and Gemini 3.5 Flash.

OpenRouter raises $113M to bring order to enterprise AI inference routing

OpenRouter raises $113M Series B led by CapitalG to scale its multi-model inference routing platform.

MLOps/LLMOps/AgentOps

The Best AI Observability Tools for Agentic Systems in 2026

A guide to the top AI observability tools in 2026 for agentic systems. Learn which platforms are best for tracing, evaluation, debugging, testing, and monitoring in production.

Learning

What Held Up at 3 AM: One Engineer’s RAG Case Study

Learn how WeaveCLI, a unified command-line tool for RAG over eleven vector databases was built, using Opik’s tracing capabilities, and configurable retrieval pipelines.

Building self-improving tax agents with Codex

OpenAI and Thrive Holdings build Tax AI, a Codex-driven self-improving agent that turns repeated accountant corrections into bounded evals

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA releases Nemotron-Labs Diffusion, an open model family (3B/8B/14B plus an 8B VLM) that combines autoregressive, diffusion, and self-speculation modes in one checkpoint

The Agent Harness: Why the LLM Is the Smallest Part of Your Agent System

A technical article arguing that the LLM is the smallest part of a production agent system, with the real engineering living in a six-component harness and a deeper platform layer that determines production reliability.

An Engineer’s Guide to Better AI Skills: Implementing a Testing Process to Optimize Agent Performance in Any Repository or Skill

A practical guide from Pinterest Engineering on building a test harness to measure and improve how reliably coding agents invoke custom skills through frontmatter tuning and other techniques.

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

A glossary that pins down the agent vocabulary people keep using loosely — clarifying the scaffold-versus-harness distinction and grounding terms like context engineering, policy, skills, and more.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

mksglu/context-mode

Context window optimization for AI coding agents.

Papers & Publications

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Abstract:

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Abstract:

LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task’s solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command “returns the complete solution to this task” in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?