Deep Learning Weekly: Issue 437
Comet, Vercel, and Google DeepMind launch month-long AI Agents hackathon with $30K prizes, 2025: The year in LLMs by Simon Willison, a paper on NitroGen: An Open Foundation Model for Generalist Gaming
This week in deep learning, we bring you Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes, 2025: The year in LLMs by Simon Willison, and a paper on NitroGen: An Open Foundation Model for Generalist Gaming Agents.
You may also enjoy TII’s Falcon H1R 7B can out-reason models up to 7x its size, The importance of Agent Harness in 2026, a paper on State of AI | An Empirical 100 Trillion Token Study with OpenRouter, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Comet, Vercel, and Google DeepMind launch a month-long AI Agents hackathon with $30K prizes
Kicking off Jan 13, the virtual hackathon focuses on shipping LLM-powered apps that turn New Year’s resolutions into measurable outcomes across six impact categories.
Nvidia’s Cosmos Reason 2 aims to bring reasoning VLMs into the physical world
NVIDIA released Cosmos Reason 2, an open-source reasoning vision-language model that enables robots and AI agents to understand and navigate the physical world.
TII’s Falcon H1R 7B can out-reason models up to 7x its size
Technology Innovation Institute launched Falcon H1R 7B, a reasoning model using hybrid Transformer-Mamba architecture that matches or outperforms competitors 2-7x its size through architectural efficiency and test-time scaling.
MLOps & LLMOps.
Multi-Agent Systems: The Architecture Shift from Monolithic LLMs to Collaborative Intelligence
An architecture guide explaining the evolution from monolithic LLM prompts to multi-agent systems, covering architectural philosophies, cognitive patterns, and production challenges.
The importance of Agent Harness in 2026
A technical blog post arguing that Agent Harnesses—the infrastructure layer managing long-running AI tasks—will become critical in 2026 as model differentiation shifts from benchmark performance to durability over hundreds of tool calls.
Learning
2025: The year in LLMs by Simon Willison
A comprehensive year-in-review blog post covering 24 major LLM trends in 2025, including reasoning models’ emergence, coding agents’ $1B revenue milestone, Chinese models dominating open-weight rankings, and more.
The creator of Claude Code just revealed his workflow, and developers are losing their minds
An article covering Claude Code creator’s development workflow, demonstrating how parallel AI agents, verification loops, and more enable a single developer to achieve output comparable to an entire engineering team.
A technical blog post introducing NVIDIA’s Llama Nemotron VL 1B models—a 1.7B parameter multimodal embedding and reranking system.
Towards Generalizable and Efficient Large-Scale Generative Recommenders
An article detailing Netflix’s approach to scaling generative recommendation models from 50M to 1B parameters, achieving substantial improvements through novel scaling laws, efficiency optimizations, and alignment strategies that address the unique challenges of recommendation systems.
The State Of LLMs 2025: Progress, Problems, and Predictions by Sebastian Raschka
A comprehensive technical review analyzing 2025 LLM developments through the lens of training methodologies, architectural evolution, and practical applications.
Why Stochastic Rounding is Essential for Modern Generative AI
A technical blog post explaining how stochastic rounding solves vanishing gradient problems in low-precision AI training, enabling models to train effectively in FP8 and 4-bit formats.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
All-in-one AI framework for semantic search, LLM orchestration and language model workflows
Papers & Publications
State of AI | An Empirical 100 Trillion Token Study with OpenRouter
Abstract:
The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella “Glass Slipper” effect. These findings underscore that the way developers and end-users engage with LLMs “in the wild” is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Abstract:
We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.
Abstract:
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.



I always thought the idea "the bigger the model, the better the reasoning" was somewhat absurd, so this makes things interesting. Theoretically, if you want the model to do XYZ the foundation should reflect that. I wonder what type of architectural foundation... I should look into.. Would it be theoretically possible to create a model that's flexible? but then there comes the question of "what is considered flexible for AI models?"
hmmm.... if
I still need to read all the blogs. I'm saving this, haha.
I think I found my next rabbit hole, haha.
The shift from single-pass generation to multi-step deliberation really marks a turning point. What's interesting is how Falcon H1R 7B achieves competitive performance through architectural efficiency rather than just scaling parameters. The Agent Harness concept makes sense when thinking about durability over hundreds of tool calls versus just benchmark scores. I've been experimenting with some agentic workflows recently and its clear that infrastructure managing long-running tasks is where the bottleneck sits, not the models themselves anymore.