Deep Learning Weekly: Issue 436

GPT-5.2-Codex, Prompt Drift: The Hidden Failure Mode Undermining Agentic Systems, a paper on Adaptation of Agentic AI, and many more!

Jan 01, 2026

This week in deep learning, we bring you GPT-5.2-Codex, Prompt Drift: The Hidden Failure Mode Undermining Agentic Systems, and a paper on Adaptation of Agentic AI.

You may also enjoy Mistral OCR 3, 2025 LLM Year in Review | karpathy, a paper on: From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing GPT-5.2-Codex | OpenAI

An introduction to GPT-5.2-Codex, OpenAI’s most advanced agentic coding model optimized for complex software engineering and defensive cybersecurity.

Introducing Mistral OCR 3

An introduction to Mistral OCR 3, achieving 74% win rate over its predecessor with state-of-the-art accuracy on forms, handwriting, and complex tables.

Introducing Runway GWM-1

Runway announces a real-time General World Model family with three variants for explorable environments, interactive characters, and robotic manipulation.

Meta Platforms buys Manus to bolster its agentic AI skillset

Meta acquires Singapore-based Manus, a general-purpose AI agent that reached $100M ARR in just eight months.

MLOps & LLMOps.

Prompt Drift: The Hidden Failure Mode Undermining Agentic Systems

A blog post explaining prompt drift and how it undermines multi-step agentic systems through subtle reasoning degradation rather than clean failures.

We removed 80% of our agent’s tools

A case study about how Vercel simplified their internal text-to-SQL agent (d0) by removing 80% of specialized tools and replacing them with a single bash command execution tool.

Agents Meet Databases: The Future of Agentic Architectures

A MongoDB article exploring two architectural paths for connecting AI agents to databases: standardized MCP servers versus custom LangChain integrations, with emphasis on accuracy, security, and performance trade-offs.

Learning

2025 LLM Year in Review | karpathy

A technical retrospective by Andrej Karpathy identifying six paradigm shifts in LLMs during 2025, including the rise of reinforcement learning from verifiable rewards, the emergence of “vibe coding,” and new AI interaction paradigms like Claude Code.

Measuring no CoT math time horizon (single forward pass)

A research article measuring AI models’ ability to solve math problems without chain-of-thought reasoning.

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

A technical deep dive introducing NVIDIA Nemotron 3’s hybrid Mamba-Transformer MoE architecture with native 1M-token context and multi-environment RL training.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

github/spec-kit

An open source toolkit that allows you to focus on product scenarios and predictable outcomes instead of vibe coding every piece from scratch.

Papers & Publications

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

Abstract:

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

Adaptation of Agentic AI

Abstract:

Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.

A guest post by

Miko Planas

~~~

Jesús Martínez

Jan 1

Agentic AI mirrors the brain: performance isn’t just about power, but adaptation.

When feedback loops drift, cognition degrades silently like neural plasticity without regulation.

Intelligence emerges not from more tools, but from stable learning signals and well-tuned adaptation.

Il mecenate dell'IA

Jan 15

What ties many of these updates together is a quiet convergence: models are commoditizing, while control surfaces are multiplying. Prompt drift, evaluation, adaptation, and infra tooling all point to the same reality — performance is no longer about raw capability, but about keeping cognition stable over time.

Deep Learning Weekly

Discussion about this post

Ready for more?