Deep Learning Weekly: Issue 439
FLUX.2 [klein], Heaps do lie: debugging a memory leak in vLLM. a paper on Toward Efficient Agents: Memory, Tool learning, and Planning, and many more!
This week in deep learning, we bring you FLUX.2 [klein], Heaps do lie: debugging a memory leak in vLLM. and a paper on Toward Efficient Agents: Memory, Tool learning, and Planning.
You may also enjoy Personal Intelligence: Connecting Gemini to Google apps, How We Built a Semantic Highlight Model To Save Token Cost for RAG, a paper on ShapeR: Robust Conditional 3D Shape Generation from Casual Captures, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
FLUX.2 [klein]: Towards Interactive Visual Intelligence | Black Forest Labs
Black Forest Labs launches FLUX.2 [klein], a unified image generation and editing model achieving sub-0.5s inference on consumer GPUs (13GB VRAM) while matching models 5x its size in quality.
Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes
A virtual hackathon that focuses on shipping LLM-powered apps that turn New Year’s resolutions into measurable outcomes across six impact categories.
Personal Intelligence: Connecting Gemini to Google apps
Google launches Personal Intelligence beta for Gemini, connecting Gmail, Photos, YouTube, and Search with one tap to enable contextual, personalized AI assistance.
Veo 3.1 Ingredients to Video: More consistency, creativity and control
Google announces Veo 3.1 “Ingredients to Video” update featuring native vertical video generation, improved character consistency, and state-of-the-art upscaling for mobile-first content creation.
Preply raises $150M to enhance human-led language learning with AI
Language learning marketplace Preply raises $150M Series D at $1.2B valuation to scale AI-enhanced human tutoring.
OpenAI quietly launches ChatGPT Translate with support for 25 languages
OpenAI quietly launches ChatGPT Translate as a free, standalone web prototype supporting 25 languages, targeting student learning, business documents, and travel use cases.
MLOps & LLMOps
Heaps do lie: debugging a memory leak in vLLM.
An engineering deep-dive documenting Mistral AI’s investigation of a 400 MB/minute memory leak in vLLM during disaggregated serving, ultimately traced to UCX’s mmap hooking mechanism interfering with Python’s memory allocator..
gRPC as a custom transport for MCP
A technical blog post explaining Google Cloud’s initiative to enable gRPC as a native transport for Model Context Protocol – eliminating transcoding overhead, enabling bidirectional streaming, and more.
A blog post arguing that files and filesystems are emerging as the core abstraction for agentic AI, with agents using ~5-10 tools (CLI, code interpreter, web fetch) operating on files proving more general than agents with 100+ MCP tools.
Learning
Token Optimization Strategies for AI Agents
A practical guide to reducing LLM token consumption in agentic systems by up to 75% through model selection, prompt caching, context optimization, and structured outputs.
LLM Context Pruning: Improving RAG and Agentic AI Systems
A technical guide explaining context pruning for RAG systems, introducing Provence as a lightweight cross-encoder that performs document-level reranking and sentence-level pruning simultaneously.
How We Built a Semantic Highlight Model To Save Token Cost for RAG
A technical blog post detailing an open-source bilingual semantic highlight model that achieves 70-80% token cost reduction for RAG systems by identifying semantically relevant sentences.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
MemOS is a Memory Operating System for LLMs and AI agents that unifies store / retrieve / manage for long-term memory, enabling context-aware and personalized interactions with KB, multi-modal, tool memory, and enterprise-grade optimizations built in.
Papers & Publications
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Abstract:
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
Toward Efficient Agents: Memory, Tool learning, and Planning
Abstract:
Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.



We build faster models, sharper agents, and quieter intelligence.
Yet the real question is not how efficiently machines think,
but whether we remember why we asked them to think for us.
Progress accelerates; meaning must keep up
Great roundup of curent research and tools! The vLLM memory leak debugging post was particlarly insightful, showing how tricky these performance issues can get in production. I also found the paper on efficient agents really timely since everyone's focused on making these systems more practical. Really appreciate the MLOps section too.