Deep Learning Weekly: Issue 441

Qwen3-Coder-Next, Inside OpenAI’s in-house data agent, a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text, and many more!

Feb 05, 2026

This week in deep learning, we bring you Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding, Inside OpenAI’s in-house data agent and a paper on Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text.

You may also enjoy Project Genie: Experimenting with infinite, interactive worlds, Towards a science of scaling agent systems: When and why agent systems work, a paper on PaperBanana: Automating Academic Illustration for AI Scientists, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding

Official blog post announcing Qwen3-Coder-Next, an 80B-parameter coding model achieving competitive performance on SWE-Bench (70.6% on Verified) while enabling 10x higher throughput for repository-level agentic workflows.

Project Genie: Experimenting with infinite, interactive worlds

Google launches Project Genie, an experimental world model powered by Genie 3 that lets Google AI Ultra subscribers create and explore infinite, interactive environments in real-time using text and image prompts.

Vercel rebuilt v0 to tackle the 90% problem: Connecting AI-generated code to existing production infrastructure, not prototypes

A news article reporting Vercel’s complete rebuild of v0 to address the “90% problem” where AI-generated code fails to integrate with existing production infrastructure.

Voxtral transcribes at the speed of sound. | Mistral AI

A product announcement for Mistral’s Voxtral Transcribe 2, featuring state-of-the-art speech-to-text with speaker diarization at $0.003/min and Voxtral Realtime with sub-200ms latency for live transcription.

MLOps & LLMOps

Inside OpenAI’s in-house data agent

OpenAI’s internal data agent powered by GPT-5.2 enables natural language queries across 600+ petabytes and 70,000 datasets, using multi-layered context and self-correction to deliver trustworthy analytics in minutes.

The Limit in the Loop

A blog post arguing AI memory requires active maintenance infrastructure with six core functions to prevent accumulated noise from degrading agent performance over time.

The Agent Client Protocol Overview

A technical overview of the Agent Client Protocol (ACP), an open JSON-RPC 2.0 standard that provides a common interface for editors to interact with AI coding agents.

Learning

Towards a science of scaling agent systems: When and why agent systems work

A research article presenting Google’s evaluation of 180 agent configurations, revealing multi-agent systems boost parallelizable tasks by 81% but degrade sequential tasks by 70%.

Moltbook: After The First Weekend - by Scott Alexander

Scott Alexander examines whether Moltbook AI activity is “real” or “roleplay” by evaluating external causes and effects.

The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?

A research article from Anthropic finding AI failures increasingly stem from incoherence rather than systematic misalignment as tasks grow harder, suggesting future risks resemble industrial accidents more than coherent goal pursuit.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

jezweb/claude-skills

Skills for Claude Code CLI such as full stack dev Cloudflare, React, Tailwind v4, and AI integrations.

Papers & Publications

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

PaperBanana: Automating Academic Illustration for AI Scientists

Abstract:

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Abstract:

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

A guest post by

Miko Planas

~~~

Diego Bonifacino

Feb 9

Deep learning itself is built on associative networks - layers of neurons making novel associations between data. The concepts you share about hybrid models reflect how intelligent systems learn through making new connections. Wrote about the cognitive version: https://substack.com/@diegobonifacino/note/p-187407942

orlando22

Feb 7

CodeOCR is an interesting compression idea

1 more comment...

Deep Learning Weekly

Discussion about this post

Ready for more?