Deep Learning Weekly: Issue 455
Interaction Models: A Scalable Approach to Human-AI Collaboration, Hidden Technical Debt of AI Systems: Agent Harness, a paper on Efficient Online Memory for Large Language Models, and many more!
This week in deep learning, we bring you Interaction Models: A Scalable Approach to Human-AI Collaboration, Hidden Technical Debt of AI Systems: Agent Harness and a paper on δ-mem: Efficient Online Memory for Large Language Models.
You may also enjoy Introducing Perceptron Mk1, Teaching Claude why, a paper on ProgramBench: Can Language Models Rebuild Programs From Scratch?, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Perceptron AI launches Mk1, a video and embodied-reasoning vision-language model priced roughly 80–90% cheaper than Claude Sonnet 4.5, GPT-5, and Gemini 3.1 Pro.
Notion just turned its workspace into a hub for AI agents
Notion launches Developer Platform turning its workspace into an agent orchestration hub with custom code Workers, external database sync, and native integrations for Claude Code, Cursor, Codex, and Decagon.
Interaction Models: A Scalable Approach to Human-AI Collaboration
Thinking Machines unveils TML-Interaction-Small, a 276B MoE (12B active) interaction model trained from scratch with 200ms time-aligned micro-turns that natively handles concurrent audio, video, and text without VAD-style harnesses.
Unsloth Joins the PyTorch Ecosystem
Unsloth joins the PyTorch Ecosystem Landscape, recognizing its open-source contributions including 2× faster training with 70% less VRAM, FP8 RL for consumer GPUs, and 250M+ model downloads.
MLOps/LLMOps/AgentOps
Hidden Technical Debt of AI Systems: Agent Harness
A Hanchung Lee essay reframing Sculley’s 2015 ML technical debt diagram for the agent era, arguing the agent runtime (harness + state) — not the model — is where most spend, incidents, and architectural debt are now accumulating.
Building Blocks for Foundation Model Training and Inference on AWS
A reference guide from Amazon mapping AWS’s four-layer infrastructure stack to foundation model pre-training, post-training, and inference workloads.
How to Eliminate Pipeline Friction in AI Model Serving
A practical NVIDIA guide laying out 18 best practices to eliminate AI model-serving friction across export issues, unsupported ops, dynamic input shapes, and version mismatches
Learning
An Anthropic post detailing how teaching Claude why actions are aligned — via constitutional documents and ethical reasoning, not just demonstrations — drove blackmail rates from 96% (Opus 4) to 0% on every Claude model since Haiku 4.5.
Accelerating Gemma 4: faster inference with multi-token prediction drafters
Google releases Multi-Token Prediction (MTP) drafters for Gemma 4 models, delivering up to 3x faster inference via speculative decoding with zero quality degradation.
Vibe coding and agentic engineering are getting closer than I’d like
A post by Simon Willison observing that vibe coding and agentic engineering are converging in his own workflow as he increasingly ships production code from Claude Code without reviewing every line.
How fast is autonomous AI cyber capability advancing?
UK AISI reports the length of cyber tasks frontier models can autonomously complete is doubling every 4.7 months — accelerating from 8 months last November — with Claude Mythos Preview and GPT-5.5 exceeding even that trend.
Reimagining the mouse pointer for the AI era
A design-principles post from Google DeepMind reframing the mouse pointer as a Gemini-powered context-aware partner, built on four principles: maintain the flow, show and tell, embrace “this/that” deixis, and turn pixels into actionable entities.
Full Text Search: Architecture and Design
A technical architecture post from Pinecone introducing full-text search built on Tantivy, delivering Lucene query syntax, BM25 scoring, 18-language tokenization, and 22.7ms p50 latency on 6.4M Wikipedia articles.
Libraries & Code
An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Inspect: A framework for large language model evaluations
Papers & Publications
δ-mem: Efficient Online Memory for Large Language Models
Abstract:
Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose δ-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone’s attention computation during generation. With only an 8×8 online memory state, δ-mem improves the average score to 1.10× that of the frozen backbone and 1.15× that of the strongest non-δ-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching 1.31× on MemoryAgentBench and 1.20× on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
Abstract:
Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable’s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

