Deep Learning Weekly: Issue 450

Gemma 4, Components of A Coding Agent, a paper on VOID: Video Object and Interaction Deletion, and many more!

Apr 09, 2026

This week in deep learning, we bring you Gemma 4, Components of A Coding Agent, and a paper on VOID: Video Object and Interaction Deletion.

You may also enjoy Claude Managed Agents, Evaluating alignment of behavioral dispositions in LLMs, a paper on TriAttention: Efficient Long Reasoning with Trigonometric KV Compression, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Qwen: Qwen3.6-Plus: Towards Real World Agents

Alibaba launches Qwen3.6-Plus, a frontier agentic coding model that matches or beats Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0.

Claude Managed Agents: get to production 10x faster

Anthropic launches Claude Managed Agents in public beta — a suite of composable, cloud-hosted agent APIs that abstract away sandboxing, state management, permissioning, and orchestration, enabling teams to ship production agents in days instead of months.

Gemma 4: Our most capable open models to date

Google releases Gemma 4, a family of four open models — with the 31B ranking #3 among open models on Arena AI and outcompeting models 20x its size.

Modus secures $85M to expand AI-powered audit and accounting partnerships

Modus Audit raises $85M to deploy AI across audit and accounting firm workflows.

Ollama is now powered by MLX on Apple Silicon in preview

Ollama 0.19 launches MLX-powered inference on Apple Silicon, delivering ~2x gains in prefill and decode speed on M5 chips, with NVFP4 quantization support and smarter KV cache reuse for agentic workloads.

MLOps/LLMOps

AI Agent Evaluation: Building Reliable Systems Beyond Simple Testing

A practical guide on why standard LLM evaluation breaks for agentic systems, covering compounding failures, process vs. outcome metrics, multi-turn state tracking, and the trace-evaluate-optimize loop needed for production agents.

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

A technical blog about ActorSimulator in AWS’s Strands Evals SDK, which generates persona-consistent, goal-driven simulated users to automate multi-turn agent evaluation at scale.

Learning

Components of A Coding Agent

A breakdown by Sebastian Raschka of the six architectural components that make coding agents (Claude Code, Codex CLI) meaningfully more capable than raw LLMs in a chat UI

Quantization from the ground up

A highly interactive, ground-up explainer on LLM quantization covering floating point formats, symmetric vs. asymmetric compression, outlier handling, and empirical quality/speed tradeoffs.

Evaluating alignment of behavioral dispositions in LLMs

A blog post on evaluating behavioral alignment across 25 LLMs, finding frontier models hit ~80–83% alignment with human consensus but are systematically overconfident in ambiguous scenarios and inconsistent between self-reported and revealed behavior.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

NousResearch/hermes-agent

The self-improving AI agent built by Nous Research. It’s the only agent with a built-in learning loop.

Papers & Publications

VOID: Video Object and Interaction Deletion

Abstract:

Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Abstract:

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?