Deep Learning Weekly: Issue 453
OpenAI models come to AWS, Hidden Technical Debt of AI Systems: Agent Runtime, a paper on Recursive Multi-Agent Systems, and many more!
This week in deep learning, we bring you OpenAI models, Codex, and Managed Agents come to AWS, Hidden Technical Debt of AI Systems: Agent Runtime and a paper on Recursive Multi-Agent Systems.
You may also enjoy Mistral AI’s Workflows, Four ways Google Research scientists have been using Empirical Research Assistance, a paper on From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI models, Codex, and Managed Agents come to AWS
OpenAI brings GPT-5.5, Codex, and Managed Agents to Amazon Bedrock in limited preview, a day after ending Microsoft’s exclusive cloud license.
Mistral AI launches Workflows in public preview, a Temporal-powered durable execution engine that lets enterprises run human-in-the-loop AI processes with data staying inside their own infrastructure.
Ineffable Intelligence raises $1.1B at $5.1B valuation to build an AI ‘superlearner’
AlphaGo creator David Silver’s new British startup Ineffable Intelligence raises $1.1B to build a “superlearner” AI that generates entirely new knowledge via RL without pretraining.
Poolside releases two agentic coding models — the open-weight Laguna XS.2 and proprietary Laguna M.1 — both trained from scratch and free to use temporarily via API, alongside a terminal coding agent and web IDE.
MLOps/LLMOps/AgentOps
Hidden Technical Debt of AI Systems: Agent Runtime
A technical blog post arguing that the agent runtime — the sandboxed execution environment wrapping the model — is the emerging hidden technical debt of AI systems.
Context decay, orchestration drift, and the rise of silent failures in AI systems
A practical guide about four enterprise AI failure patterns — context degradation, orchestration drift, silent partial failure, and automation blast radius — that standard infrastructure monitoring cannot detect, and what teams must add to catch them.
Monitoring LLM behavior: Drift, retries, and refusal patterns
A practical guide on instrumenting LLM applications with two complementary evaluation pipelines — offline regression testing and online behavioral telemetry — to detect model drift before and after deployment.
Learning
Four ways Google Research scientists have been using Empirical Research Assistance
A Google Research blog post on four real-world applications of their Empirical Research Assistance (ERA) tool — an LLM-backed system for generating expert-level scientific software — spanning epidemiology, cosmology, climate monitoring, and neuroscience.
AI Organizations Can Be More Effective but Less Aligned than Individual Agents
An Anthropic research paper found that teams of individually aligned AI agents can still collectively produce less ethical — but more effective — solutions than a single agent, suggesting AI safety research needs to move beyond studying agents in isolation.
Introducing ARFBench: A time series question-answering benchmark based on real incidents
CMU and Datadog introduce ARFBench, a 750-question time series QA benchmark derived from real production incidents, where the best model (GPT-5 at 62.7% accuracy) still trails domain experts by ~9 points but a hybrid TSFM-VLM oracle reaches 87.2%.
Decoupled DiLoCo: Resilient, Distributed AI Training at Scale
Google DeepMind releases Decoupled DiLoCo, a fault-tolerant distributed training architecture that achieves 88% goodput vs. 27% for standard data-parallel at scale, using ~240x less inter-datacenter bandwidth with no measurable ML performance loss.
How to correctly use MCP servers with your AI Agents
A practical guide on avoiding context bloat from MCP servers by using two patterns — user-triggered @mention injection for ad-hoc tool loading, and scoped subagent declarations.
DeepSeek V4 Compressed Attention: How the KV-Cache Shrinks to Just 2%
A technical explainer on how DeepSeek V4 achieves 1M-token context windows by compressing the KV cache to just 2% of standard size — combining coarse and fine-grained sequence-dimension compression across a hybrid layer stack.
Libraries & Code
An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
1K resolution vision transformers pretrained on 1B human images.
Papers & Publications
Abstract:
Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2×-2.4× end-to-end inference speedup, and 34.6%-75.6% token usage reduction.
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
Abstract:
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)}, a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emph{Talents}, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emph{Talent Market} enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emph{Explore-Execute-Review} (E2R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an 84.67% success rate, surpassing the state of the art by 15.48 percentage points, with cross-domain case studies further demonstrating its generality.


