Deep Learning Weekly: Issue 459

Claude Fable 5, Cohere’s North Mini Code, a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering, and many more!

Jun 12, 2026

This week in deep learning, we bring you Claude Fable 5, North Mini Code: Cohere’s First Model For Developers and a paper on Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering.

You may also enjoy NVIDIA Nemotron 3 Ultra, Controlling the capital after AGI, a paper on Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Claude Fable 5 and Claude Mythos 5 \ Anthropic

Anthropic launches Claude Fable 5, a general-access Mythos-class model at $10/$50 per M tokens, with classifier-based fallbacks to Opus 4.8 for cyber, bio, and distillation queries.

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google releases Gemma 4 12B, an encoder-free multimodal model that runs on 16GB VRAM and processes vision and audio natively through the LLM backbone

NVIDIA Nemotron 3 Ultra

NVIDIA releases Nemotron 3 Ultra — 550B total / 55B active MoE hybrid Mamba-Transformer, pretrained in NVFP4, with up to 5.9x throughput over competing open MoEs and 1M token context.

Confidential submission of draft S-1 to the SEC | OpenAI

OpenAI files a confidential S-1 with the SEC, preemptively announcing it publicly ahead of an expected leak — while noting IPO timing remains undecided as some strategic moves are easier as a private company.

MLOps/LLMOps/AgentOps

How Google SRE is using agentic AI to improve operations

An article about how Google SRE is wiring agentic AI across the full incident lifecycle — dynamic anomaly detection, autonomous investigation, and a RAG layer over historical incidents to inform mitigation agents.

Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic RAG

Google launches an agentic RAG framework on Gemini Enterprise Agent Platform with a Sufficient Context Agent that iterates until retrieval gaps are filled, hitting 90.1% accuracy on multi-hop queries — up to 34% over vanilla RAG.

Learning

North Mini Code: Cohere’s First Model For Developers

An article about how Cohere trained North Mini Code’s agentic coding capabilities — using cascaded SFT as an RLVR primer, joint multi-environment RL across 70k containerized repos, and cross-harness data mixing to generalize across SWE-Agent, mini-SWE-agent, and OpenCode scaffolds.

Testing Gemini models for scheming tendencies | by DeepMind Safety Research

DeepMind releases two scheming eval frameworks for Gemini — Gram (simulated agentic environments) and honeypots (real safety codebases) — finding 2–3% unprompted sabotage rates and no coherent misalignment.

Claude Fable 5 and new AI safety fables

Nathan Lambert argues that Anthropic’s undisclosed safety filters in Claude Fable 5 — which silently degrade responses for frontier AI research without notifying users — are competitive entrenchment dressed as safety policy.

Controlling the capital after AGI

An analytical piece from Epoch AI taxonomizing post-AGI wealth redistribution proposals — UBI, UBS, UBC, and sovereign wealth funds — along a single axis: how much control over capital, not just income, each scheme grants citizens.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Papers & Publications

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Abstract:

LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages.

Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Abstract:

Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent’s ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?