Deep Learning Weekly: Issue 448
Cursor's Composer 2, TurboQuant: Redefining AI efficiency with extreme compression, a paper on Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation, and many more!
This week in deep learning, we bring you Cursor’s Composer 2, TurboQuant: Redefining AI efficiency with extreme compression and a paper on Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation.
You may also enjoy What 81,000 people want from AI \ Anthropic, Evaluating agentic search in OpenSearch, a paper on OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing Composer 2 · Cursor
Cursor launches Composer 2, a frontier-level coding model trained via continued pretraining and long-horizon RL that scores 61.3 on CursorBench and 73.7 on SWE-bench Multilingual
What 81,000 people want from AI \ Anthropic
Anthropic’s largest-ever qualitative study — 80,508 Claude users across 159 countries and 70 languages — reveals what people want from AI, what they’ve already gotten, and what they fear.
Lyria 3 Pro: Create longer tracks in more Google products
Google launches Lyria 3 Pro, an upgraded music generation model that produces tracks up to 3 minutes with structural song awareness (intros, verses, choruses, bridges).
MolmoWeb: An open agent for automating web tasks
Allen AI releases MolmoWeb, a fully open visual web agent built on Molmo 2 that scores 78.2% on WebVoyager and 73.7% on SWE-bench Multilingual, outperforming GPT-4o-based agents while releasing all weights, training data, and evaluation tools.
OpenAI acquires Astral — maker of Python developer tools uv, Ruff, and ty used by millions of developers — to deepen its Codex ecosystem.
MLOps/LLMOps
Building an MCP Ecosystem at Pinterest
Pinterest Engineering details how they scaled MCP from concept to a production ecosystem of domain-specific servers — Presto, Spark, Knowledge — with a central registry, two-layer auth, and 66,000 monthly invocations saving an estimated 7,000 engineer-hours per month.
Run cloud agents in your own infrastructure
Cursor launches self-hosted cloud agents GA, keeping code and tool execution entirely within enterprise infrastructure while Cursor handles orchestration and inference.
Learning
TurboQuant: Redefining AI efficiency with extreme compression
Google Research releases TurboQuant, a KV cache quantization method that achieves 6x+ memory reduction to 3 bits with zero accuracy loss and 8x attention speedup on H100s.
Evaluating agentic search in OpenSearch
A technical deep-dive on how OpenSearch benchmarked its agentic search feature across search relevance (BEIR and BRIGHT datasets) and query execution accuracy (Spider dataset), powered by Claude Opus 4.6.
Scaling Karpathy’s Autoresearch: What Happens When the Agent Gets a GPU Cluster
A technical blog post on how SkyPilot scaled Karpathy’s autoresearch agent from 1 to 16 GPUs, enabling ~910 experiments in 8 hours.
How Anthropic’s Claude Thinks - ByteByteGo Newsletter
ByteByteGo breaks down Anthropic’s interpretability research into six concrete findings about how Claude actually thinks — from parallel math strategies to ahead-of-time poetry planning to a default-refusal circuit that misfires into hallucinations.
A Visual Guide to Attention Variants in Modern LLMs
A visual reference guide mapping seven attention variants — MHA, GQA, MLA, SWA, DeepSeek Sparse Attention, Gated Attention, and hybrid architectures — across the open-weight models currently using them in production.
Fast regex search: indexing text for agent tools
A technical deep-dive on how Cursor built a local sparse n-gram index to replace ripgrep for agent search — eliminating 15+ second grep latency in large monorepos by narrowing regex matches to a pre-filtered candidate set before full scanning.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai/teen-safety-policy-pack
A set of prompt-based safety policies designed to create age-appropriate protections for teens.
Papers & Publications
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
Abstract:
Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy.
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Abstract:
Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-k similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader’s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.


