Deep Learning Weekly: Issue 445
Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems, Nano Banana 2: Combining Pro capabilities with lightning-fast speed, and many more!
This week in deep learning, we bring you Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems, Nano Banana 2: Combining Pro capabilities with lightning-fast speed and a paper on Beyond Language Modeling: An Exploration of Multimodal Pretraining.
You may also enjoy Gemini 3.1 Flash-Lite, Personalization features can make LLMs more agreeable, a paper on dLLM: Simple Diffusion Language Modeling, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Gemini 3.1 Flash-Lite: Built for intelligence at scale
Google launches Gemini 3.1 Flash-Lite in preview, positioning it as their fastest and most cost-efficient model yet at $0.25/1M input tokens — built specifically for high-volume developer workloads demanding both speed and reasoning.
GPT-5.3 Instant: Smoother, more useful everyday conversations
OpenAI releases GPT-5.3 Instant as the new default ChatGPT model, cutting hallucinations by up to 26.8% and dramatically reducing the over-cautious, “cringe” responses that frustrated everyday users.
Statement from Dario Amodei on our discussions with the Department of War
Anthropic’s Dario Amodei publicly refuses Department of War demands to remove AI safeguards on mass domestic surveillance and fully autonomous weapons.
Alibaba’s Qwen AI team loses its founding technical lead and two key researchers just 24 hours after shipping the Qwen3.5 small model series, raising alarm about the project’s open-source future and triggering a 5% drop in Alibaba’s stock.
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
Microsoft releases Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that rivals much larger models on math, science, and computer-use tasks while requiring a fraction of the training compute.
Nano Banana 2: Combining Pro capabilities with lightning-fast speed
Google launches Nano Banana 2 (Gemini 3.1 Flash Image), combining the advanced quality of Nano Banana Pro with Flash-level speed, rolling out across Gemini, Search, Google Ads, Vertex AI, and Flow.
MLOps/LLMOps
Opik Claude Code Plugin: Automatically Configure Observability for Complex Agentic Systems
Announcing the new Opik Claude Code Plugin, which automatically instruments Python and JavaScript agent code with tracing, applies observability best practices, and logs what Claude Code is doing as it modifies a system.
Improve chatbot memory using Google Cloud
A practical guide about building scalable long-term memory for agentic chatbots using a three-tier polyglot storage architecture on Google Cloud (Redis, Bigtable, BigQuery).
Learning
Personalization features can make LLMs more agreeable
MIT/Penn State research finds LLM personalization features significantly amplify sycophantic behavior, with memory-stored user profiles having the greatest effect across 4 of 5 models tested in real two-week user interactions.
The threat of AI-generated code to the world’s digital infrastructure
An article about how AI-enabled “vibe contributing” — low-quality, AI-generated code submitted by novice contributors — is overwhelming volunteer open source maintainers and threatening the stability of global digital infrastructure.
Teaching LLMs to reason like Bayesians
A research blog post about how Google trained LLMs to reason like optimal Bayesian agents via fine-tuning on Bayesian model outputs, dramatically improving probabilistic belief-updating across domains.
Mixture of Experts (MoEs) in Transformers
A technical blog post about how Hugging Face redesigned the transformers library to make Mixture-of-Experts (MoE) models first-class citizens, covering weight loading, expert routing backends, parallelism, and training optimizations.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. https://github.com/comet-ml/opik
A minimal, secure Python interpreter written in Rust for use by AI.
Papers & Publications
dLLM: Simple Diffusion Language Modeling
Abstract:
Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures.
To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Abstract:
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.


