Deep Learning Weekly: Issue 441
Qwen3-Coder-Next, Inside OpenAI’s in-house data agent, a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text, and many more!
This week in deep learning, we bring you Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding, Inside OpenAI’s in-house data agent and a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text.
You may also enjoy Project Genie: Experimenting with infinite, interactive worlds, Towards a science of scaling agent systems: When and why agent systems work, a paper on PaperBanana: Automating Academic Illustration for AI Scientists, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding
Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.
Project Genie: Experimenting with infinite, interactive worlds
Google launches Project Genie, an experimental world model powered by Genie 3 that lets Google AI Ultra subscribers create and explore infinite, interactive environments in real-time using text and image prompts.
A news article reporting Vercel’s complete rebuild of v0 to address the “90% problem” where AI-generated code fails to integrate with existing production infrastructure.
Voxtral transcribes at the speed of sound. | Mistral AI
A product announcement for Mistral’s Voxtral Transcribe 2, featuring state-of-the-art speech-to-text with speaker diarization at $0.003/min and Voxtral Realtime with sub-200ms latency for live transcription.
MLOps & LLMOps
Inside OpenAI’s in-house data agent
OpenAI’s internal data agent powered by GPT-5.2 enables natural language queries across 600+ petabytes and 70,000 datasets, using multi-layered context and self-correction to deliver trustworthy analytics in minutes.
A blog post arguing AI memory requires active maintenance infrastructure with six core functions to prevent accumulated noise from degrading agent performance over time.
The Agent Client Protocol Overview
A technical overview of the Agent Client Protocol (ACP), an open JSON-RPC 2.0 standard that provides a common interface for editors to interact with AI coding agents.
Learning
Towards a science of scaling agent systems: When and why agent systems work
A research article presenting Google’s evaluation of 180 agent configurations, revealing multi-agent systems boost parallelizable tasks by 81% but degrade sequential tasks by 70%.
Moltbook: After The First Weekend - by Scott Alexander
Scott Alexander examines whether Moltbook AI activity is “real” or “roleplay” by evaluating external causes and effects.
The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?
A research article from Anthropic finding AI failures increasingly stem from incoherence rather than systematic misalignment as tasks grow harder, suggesting future risks resemble industrial accidents more than coherent goal pursuit.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Skills for Claude Code CLI such as full stack dev Cloudflare, React, Tailwind v4, and AI integrations.
Papers & Publications
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
PaperBanana: Automating Academic Illustration for AI Scientists
Abstract:
Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Abstract:
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.



Deep learning itself is built on associative networks - layers of neurons making novel associations between data. The concepts you share about hybrid models reflect how intelligent systems learn through making new connections. Wrote about the cognitive version: https://substack.com/@diegobonifacino/note/p-187407942
CodeOCR is an interesting compression idea