Deep Learning Weekly: Issue 460
GLM 5.2, Understanding Your Claude Code Spend: What’s Actually Driving the Cost, a paper on Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories, and many more!
This week in deep learning, we bring you GLM 5.2, Understanding Your Claude Code Spend: What’s Actually Driving the Cost and a paper on Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories.
You may also enjoy Kimi K2.7 Code, M*: A Modular, Extensible, Serving System for Multimodal Models, a paper on FastContext: Training Efficient Repository Explorer for Coding Agents, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Kimi K2.7 Code: Open-Source Agentic Coding Model
Moonshot AI launches Kimi K2.7 Code, an open-source 1T-parameter MoE coding model gaining up to 31.5% on benchmarks while cutting thinking-token usage ~30% versus K2.6.
MolmoMotion: Language-guided 3D motion forecasting
Ai2 releases MolmoMotion, a language-guided model that forecasts object 3D point trajectories from video, beating prior methods on motion forecasting, robot planning, and video generation.
Stanford’s DeLM cuts multi-agent task costs 50% — without a central orchestrator
Stanford’s DeLM replaces central orchestrators with shared verified context, beating the strongest baseline by 10.5% on SWE-bench Verified at roughly half the cost.
GLM-5.2: Built for Long-Horizon Tasks
Z.ai launches GLM-5.2, an open-source 753B-parameter MoE model with a stable 1M-token context, beating GLM-5.1 by wide margins on Terminal-Bench 2.1 and SWE-bench Pro while trailing Claude Opus 4.8 by just 1% on FrontierSWE.
MLOps/LLMOps/AgentOps
Hidden Technical Debt of AI Systems: Agent Evaluation Infrastructure
A detailed guide arguing agent evaluation requires a control-plane/data-plane system spanning traces, state deltas, checkpoints, and replay — not just a single benchmark score.
M*: A Modular, Extensible, Serving System for Multimodal Models
Stanford’s M* replaces vLLM/SGLang’s single autoregressive loop with a generic “Walk Graph,” beating specialized serving systems by up to 2.7x on speech, 2.6x on image editing, and 12.5x on world-model rollouts.
Learning
Understanding Your Claude Code Spend: What’s Actually Driving the Cost
Long and inefficient context can quietly drive up costs and reduce performance in Claude Code. Learn five techniques for auditing context usage and eliminating unnecessary token overhead.
How Evaluation-Driven Development (EDD) Works
Learn how retrieval, memory, and context management have emerged as critical infrastructure for AI agents, often having a greater impact on system performance than the underlying model itself.
Pre-Training Isn’t Bitter Enough
CMU proposes V-pretraining, which uses a small labeled feedback set to train a task designer that shapes self-supervised targets, lifting Qwen2.5-0.5B’s GSM8K Pass@1 from 22.20 to 29.60 without directly supervising the learner.
First Steps Toward Automated AI Research
Recursive’s automated AI research system beats human-optimized SOTA on three benchmarks: 0.0263 lower BPB on fixed-budget LM training, 2.2s faster on NanoGPT Speedrun, and an 18% gap reduction on GPU kernel optimization.
Three Ways Codex Can Use a Computer
A practical guide on routing OpenAI Codex tasks across Computer Use, Chrome, and the in-app browser based on whether the job needs native apps, signed-in sites, or public-page review.
Portable vLLM Model Inference Kernels in Helion
Red Hat and Meta integrate Helion’s PyTorch-native kernel DSL into vLLM for FP8 Qwen3 inference, beating existing CUDA/TorchInductor kernels on most ops while still trailing CUTLASS on GEMM for Blackwell GPUs.
Libraries & Code
An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Use Codex from Claude Code to review code or delegate tasks.
Papers & Publications
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
Abstract:
Data tells stories that shape society; the data journalist’s job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting.
FastContext: Training Efficient Repository Explorer for Coding Agents
Abstract:
Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent’s context with irrelevant snippets. In most agents, the same model explores the repository and solves the task, leaving exploratory reads and searches in the solver’s history. We present FastContext, a dedicated exploration subagent that separates repository exploration from solving. Invoked on demand, FastContext issues parallel tool calls and returns concise file paths and line ranges as focused context. FastContext is powered by specialized exploration models spanning 4B--30B parameters. We bootstrap them from strong reference-model trajectories and refine them with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation. Across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates up to 5.5% while reducing coding-agent token consumption up to 60%, with marginal overhead. These results show that repository exploration can be separated from solving and handled effectively by specialized models.


