Deep Learning Weekly: Issue 459
Claude Fable 5, Cohere’s North Mini Code, a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering, and many more!
This week in deep learning, we bring you Claude Fable 5, North Mini Code: Cohere’s First Model For Developers and a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering.
You may also enjoy NVIDIA Nemotron 3 Ultra, Controlling the capital after AGI, a paper on Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Claude Fable 5 and Claude Mythos 5 \ Anthropic
Anthropic launches Claude Fable 5, a general-access Mythos-class model at $10/$50 per M tokens, with classifier-based fallbacks to Opus 4.8 for cyber, bio, and distillation queries.
Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Google releases Gemma 4 12B, an encoder-free multimodal model that runs on 16GB VRAM and processes vision and audio natively through the LLM backbone
NVIDIA releases Nemotron 3 Ultra — 550B total / 55B active MoE hybrid Mamba-Transformer, pretrained in NVFP4, with up to 5.9x throughput over competing open MoEs and 1M token context.
Confidential submission of draft S-1 to the SEC | OpenAI
OpenAI files a confidential S-1 with the SEC, preemptively announcing it publicly ahead of an expected leak — while noting IPO timing remains undecided as some strategic moves are easier as a private company.
MLOps/LLMOps/AgentOps
How Google SRE is using agentic AI to improve operations
An article about how Google SRE is wiring agentic AI across the full incident lifecycle — dynamic anomaly detection, autonomous investigation, and a RAG layer over historical incidents to inform mitigation agents.
Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic RAG
Google launches an agentic RAG framework on Gemini Enterprise Agent Platform with a Sufficient Context Agent that iterates until retrieval gaps are filled, hitting 90.1% accuracy on multi-hop queries — up to 34% over vanilla RAG.
Learning
North Mini Code: Cohere’s First Model For Developers
An article about how Cohere trained North Mini Code’s agentic coding capabilities — using cascaded SFT as an RLVR primer, joint multi-environment RL across 70k containerized repos, and cross-harness data mixing to generalize across SWE-Agent, mini-SWE-agent, and OpenCode scaffolds.
Testing Gemini models for scheming tendencies | by DeepMind Safety Research
DeepMind releases two scheming eval frameworks for Gemini — Gram (simulated agentic environments) and honeypots (real safety codebases) — finding 2–3% unprompted sabotage rates and no coherent misalignment.
Claude Fable 5 and new AI safety fables
Nathan Lambert argues that Anthropic’s undisclosed safety filters in Claude Fable 5 — which silently degrade responses for frontier AI research without notifying users — are competitive entrenchment dressed as safety policy.
Controlling the capital after AGI
An analytical piece from Epoch AI taxonomizing post-AGI wealth redistribution proposals — UBI, UBS, UBC, and sovereign wealth funds — along a single axis: how much control over capital, not just income, each scheme grants citizens.
Libraries & Code
An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Papers & Publications
Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering
Abstract:
LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages.
Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
Abstract:
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent’s ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.


