Deep Learning Weekly: Issue 446

Native Observability & Alerts for Your OpenClaw with Opik, Gemini Embedding 2, a paper on LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory, and many more!

Mar 12, 2026

This week in deep learning, we bring you Native Observability & Alerts for Your OpenClaw with Opik, Gemini Embedding 2, and a paper on LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory.

You may also enjoy GPT-5.4, a paper on DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini Embedding 2: Our first natively multimodal embedding model

Google launches Gemini Embedding 2, its first natively multimodal embedding model unifying text, images, video, audio, and documents into a single semantic space.

OpenAI to acquire Promptfoo

OpenAI is acquiring Promptfoo — an AI security platform used by 25%+ of Fortune 500 companies — to embed red-teaming, jailbreak detection, and agentic risk evaluation natively into its enterprise Frontier platform.

Introducing GPT-5.4 | OpenAI

OpenAI launches GPT-5.4 with a 1M-token context, new Tool Search API, and record scores on coding and knowledge-work benchmarks — its most capable frontier model for professional and agentic use.

Google upgrades Gemini for Workspace allowing it to pull data from multiple apps to create Docs, Sheets, Slides and more

Google lets Gemini generate fully-formed Docs, Sheets, and Slides by pulling from Gmail, Drive, and Chat — turning Workspace into a single-prompt content creation engine.

Yann LeCun’s AMI Labs raises $1.03B to build world models | TechCrunch

Yann LeCun’s AMI Labs raises $1.03B at a $3.5B valuation to build JEPA-based world models — AI that learns from reality rather than language — with NVIDIA, Samsung, and Eric Schmidt among backers.

MLOps/LLMOps

Native Observability & Alerts for Your OpenClaw with Opik

A blog post announcing opik-openclaw, a native OpenClaw plugin from Comet that adds full-stack observability — tracing every LLM call, tool execution, token cost, and sub-agent delegation — to address the visibility gap in autonomous agent workflows.

Learning

Improving instruction hierarchy in frontier LLMs

A technical research post about OpenAI’s IH-Challenge — an RL training dataset that teaches models a strict trust hierarchy (System > Developer > User > Tool) to resist prompt injection, jailbreaks, and instruction conflicts.

Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

A technical blog post about NVIDIA’s concept-driven synthetic data pipeline that generated 15M Python programming problems, yielding a 6-point HumanEval gain (73→79) when included in Nemotron-Nano-v3 pretraining.

Practical Guide to Evaluating and Testing Agent Skills

A practical guide about building lightweight eval harnesses for agent skills, walking through how to define success criteria, construct prompt sets, and iterate — illustrated by taking a Gemini Interactions API skill from 66.7% to 100% pass rate.

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

A Stanford benchmark revealing that frontier models (GPT-5.2, Gemini-3 Pro, Claude 4.5 Sonnet) all fail to build accurate, revisable cognitive maps during active spatial exploration — humans consistently outperform all of them.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

alibaba/page-agent

JavaScript in-page GUI agent. Control web interfaces with natural language.

Papers & Publications

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Abstract:

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Abstract:

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

Deep Learning Weekly

Discussion about this post

Ready for more?