Deep Learning Weekly: Issue 447
Mamba-3, Agent-native Architectures: How to Build Apps After Code Ends, a paper on Attention Residuals, and many more!
This week in deep learning, we bring you Mamba-3, Agent-native Architectures: How to Build Apps After Code Ends and a paper on Attention Residuals.
You may also enjoy Introducing Mistral Small 4, State of RL for reasoning LLMs, a paper on Data Agents: Levels, State of the Art, and Open Problems, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Together AI releases Mamba-3, an inference-first state space model that outperforms Mamba-2, Gated DeltaNet, and Transformer-based Llama-3.2-1B on end-to-end latency at the 1.5B scale.
Mistral releases Small 4 — an open-source, 119B-parameter MoE model unifying reasoning, multimodal, and coding capabilities, delivering 40% lower latency and 3x higher throughput than its predecessor.
Claude builds interactive visuals right in your conversation
Anthropic launches inline interactive charts, diagrams, and visualizations in Claude chat — available in beta across all plan types.
Measuring Progress Towards AGI: A Cognitive Framework
Google DeepMind releases a cognitive taxonomy paper proposing 10 human-grounded abilities to measure AGI progress, paired with a $200,000 Kaggle hackathon to crowdsource the missing benchmarks.
Gumloop reels in $50M for its AI automation platform
Gumloop raises $50M Series B led by Benchmark — with participation from Shopify Ventures and Y Combinator — bringing total funding to $70M for its no-code, drag-and-drop AI agent automation platform.
Okta unveils new framework to manage AI agents and upcoming Okta for AI Agents platform
Okta unveils a security blueprint for the agentic enterprise and announces its “Okta for AI Agents” platform – treating AI agents as governed, non-human identities with centralized access control and a kill switch for rogue agents.
MLOps/LLMOps
Agent-native Architectures: How to Build Apps After Code Ends
A technical guide on building agent-native applications — software architectures where agents are first-class citizens, using atomic tools and outcome-driven loops instead of hardcoded workflows.
Learning
State of RL for reasoning LLMs
A technical deep-dive surveying the evolution of reinforcement learning algorithms for reasoning LLMs (2024–2026), tracing the lineage from REINFORCE and PPO through GRPO and eight successor methods
Many SWE-bench-Passing PRs Would Not Be Merged into Main - METR
METR researchers found that roughly half of SWE-bench Verified PRs that pass the automated grader would not actually be merged by real repo maintainers, with automated grader scores averaging 24 percentage points higher than maintainer merge rates.
LumberChunker: Long-Form Narrative Document Segmentation
An article about LumberChunker, a RAG chunking method that uses an LLM to detect semantic boundaries in long-form narrative documents, achieving DCG@20 of 62.1% on the GutenQA benchmark — outperforming all fixed-size and recursive baselines.
VAGEN: Teaching Vision-Language Models to Build World Models Through Reinforcement Learning
A Stanford AI Lab research blog post about VAGEN, a reinforcement learning framework that trains 3B-parameter VLM agents to build internal world models via structured state estimation and transition predictions.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
A mini CLI search engine for your docs, knowledge bases, meeting notes, and more.
Papers & Publications
Abstract:
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.
Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.
Data Agents: Levels, State of the Art, and Open Problems
Abstract:
Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term “data agent” is currently used inconsistently, conflating simple query responsive assistants with aspirational fully autonomous “data scientists”. This ambiguity blurs capability boundaries and accountability, making it difficult for users, system builders, and regulators to reason about what a “data agent” can and cannot do.
In this tutorial, we propose the first hierarchical taxonomy of data agents from Level 0 (L0, no autonomy) to Level 5 (L5, full autonomy). Building on this taxonomy, we will introduce a lifecycleand level-driven view of data agents. We will (1) present the L0-L5 taxonomy and the key evolutionary leaps that separate simple assistants from truly autonomous data agents, (2) review representative L0-L2 systems across data management, preparation, and analysis, (3) highlight emerging Proto-L3 systems that strive to autonomously orchestrate end-to-end data workflows to tackle diverse and comprehensive data-related tasks under supervision, and (4) discuss forward-looking research challenges towards proactive (L4) and generative (L5) data agents. We aim to offer both a practical map of today’s systems and a research roadmap for the next decade of data-agent development.
Abstract:
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

