Deep Learning Weekly: Issue 451
The 2026 AI Index Report, MirrorCode: Evidence that AI can already do some weeks-long coding tasks, a paper on Introspective Diffusion Language Models, and many more!
This week in deep learning, we bring you The 2026 AI Index Report, MirrorCode: Evidence that AI can already do some weeks-long coding tasks and a paper on Introspective Diffusion Language Models.
You may also enjoy Gemini Robotics ER 1.6: Enhanced Embodied Reasoning, Should AI Step Aside?: Teaching Agents When Humans Want to Intervene, a paper on CodeTracer: Towards Traceable Agent States, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The 2026 AI Index Report | Stanford HAI
Stanford HAI releases the 2026 AI Index — a 400+ page annual report tracking AI’s technical performance, investment, labor market effects, policy landscape, and public sentiment across nine chapters.
Gemini Robotics ER 1.6: Enhanced Embodied Reasoning
Google DeepMind releases Gemini Robotics-ER 1.6, a robotics-specialized reasoning model with upgraded spatial reasoning, multi-view success detection, and instrument reading (93% accuracy with agentic vision).
Introducing Muse Spark: Scaling Towards Personal Superintelligence
Meta’s Superintelligence Labs launches Muse Spark — a natively multimodal reasoning model with multi-agent “Contemplating” mode that achieves 58% on Humanity’s Last Exam.
Gemini 3.1 Flash TTS: the next generation of expressive AI speech
Google launches Gemini 3.1 Flash TTS, a text-to-speech model with natural-language audio tags for granular vocal control across 70+ languages, available via Gemini API, Vertex AI, and Google Vids.
Introducing MAI-Image-2-Efficient: Faster, More Efficient Image Generation
Microsoft releases MAI-Image-2-Efficient — 22% faster and 4x more GPU-efficient than MAI-Image-2, targeting high-volume and real-time image generation workloads.
Introducing routines in Claude Code
Anthropic launches Routines in Claude Code — serverless automations triggered by schedule, API call, or GitHub webhook events, with daily limits of 5–25 runs depending on plan tier.
MLOps/LLMOps
Multimodal LLM Evaluation: A Developer’s Guide to Multimodal Language Models
A guide to evaluating multimodal LLMs, highlighting why text-only metrics fall short for image, audio, and video inputs, while outlining methods for grounding outputs and using LLM-based evaluation to measure real-world performance.
Learning
Should AI Step Aside?: Teaching Agents When Humans Want to Intervene
A research blog post introducing CowCorpus and PlowPilot — a dataset and intervention-aware web agent system that predicts when users want to take over, yielding a 26.5% improvement in user-rated usefulness over a fully autonomous baseline.
MirrorCode: Evidence that AI can already do some weeks-long coding tasks
A research report from Epoch AI introducing MirrorCode, a long-horizon coding benchmark, showing Claude Opus 4.6 can autonomously reimplement a 16,000-line bioinformatics toolkit estimated to take a human engineer 2–17 weeks.
8 Tips for Writing Agent Skills
A practical guide on authoring effective agent skills, covering description precision, instruction conciseness, layered context loading, and when to retire skills as model capabilities advance.
The AI Revolution in Math Has Arrived
A Quanta Magazine feature documenting how AI has become a genuine research accelerator, with mathematicians using it to discover and prove new results in days rather than months.
Unsloth’s technical guide for fine-tuning Google’s Gemma 4 family covering VRAM requirements, critical bug fixes for KV-sharing and gradient accumulation, and recipes for SFT, vision, audio, and GRPO training.
Towards developing future-ready skills with generative AI
A Google blog post introducing Vantage, a GenAI-powered assessment platform that places students in AI-simulated multi-party conversations to measure “future-ready” skills.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
A pattern for building personal knowledge bases using LLMs.
Papers & Publications
Introspective Diffusion Language Models
Abstract:
Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.
CodeTracer: Towards Traceable Agent States
Abstract:
Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent’s state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.


