Deep Learning Weekly: Issue 399
The 2025 AI Index Report, RL backlog: OpenAI's many RLs, clarifying distillation and latent reasoning, a paper on Do LLMs Estimate Uncertainty Well in Instruction-Following?, and many more!
This week in deep learning, we bring you The 2025 AI Index Report, RL backlog: OpenAI's many RLs, clarifying distillation, and latent reasoning, and a paper on Do LLMs Estimate Uncertainty Well in Instruction-Following?.
You may also enjoy Llama 4, How Contributing to Open Source Projects Helped Me Build My Dream Career in AI, a paper on PaperBench: Evaluating AI's Ability to Replicate AI Research, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The 2025 AI Index Report | Stanford HAI
Stanford HAI has released its 2025 AI Index Report which provides rigorous, objective insights into AI’s technical progress, economic influence, and societal impact.
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Meta introduced Llama 4 Scout and Llama 4 Maverick, the first open-weight multimodal models with unprecedented context length support and first one built using a mixture-of-experts (MoE) architecture.
New capabilities in Azure AI Foundry to build advanced agentic applications
The Azure team announced new capabilities within Azure AI Foundry, including an agent framework and AI Red Teaming Agent, for building advanced agentic applications.
Taking a responsible path to AGI - Google DeepMind
Google DeepMind explores the frontiers of AGI, prioritizing readiness, proactive risk assessment, and collaboration with the wider AI community.
MLOps & LLMOps
Parsing is Hard: Solving Semantic Understanding with Mistral OCR and Milvus
A blog post showcasing a solution for enhanced document parsing and semantic search using Mistral OCR and Milvus.
Model Context Protocol (MCP) an overview
An explanatory blog post offering a comprehensive overview of the Model Context Protocol, including hosts, clients, and servers.
Optimize Gemma 3 Inference: vLLM on GKE
A technical article exploring the inference optimization of a Gemma 3 model on Google Kubernetes Engine (GKE) using the vLLM library and different NVIDIA GPUs.
Learning
How Contributing to Open Source Projects Helped Me Build My Dream Career in AI
Claire Longo, an AI Researcher and Developer Advocate at Comet, discusses her journey in AI and offers insights to others on how to get involved in the OSS community, build portfolios, and more to help kickstart a career in AI.
RL backlog: OpenAI's many RLs, clarifying distillation, and latent reasoning
A blog post discussing various RL applications at OpenAI, providing insights into the role of distillation in model training and exploring the concept of latent reasoning.
Designing for AI Engineers: UI patterns you need to know
A reference article providing UI design principles and practical patterns tailored for AI engineers building tools like chat interfaces and model repositories.
MENTAT: A Clinician-Annotated Benchmark for Complex Psychiatric Decision-Making
An article introducing MENTAT, a new clinician-annotated benchmark designed to evaluate the complex decision-making abilities of language models in realistic psychiatric scenarios.
Libraries & Code
A Model Context Protocol implementation for the Opik platform
An open agentic framework that uses computers like a human
Papers & Publications
Do LLMs Estimate Uncertainty Well in Instruction-Following?
Abstract:
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stemming from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide crucial understanding of LLMs’ limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.
PaperBench: Evaluating AI's Ability to Replicate AI Research
Abstract:
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline.