Deep Learning Weekly: Issue 402
Convergence 2025: GenAI Engineering One Line at a Time, ChatGPT's Sycophancy, a paper on Workshop-Level Automated Scientific Discovery via Agentic Tree Search, and many more!
This week in deep learning, we bring you Convergence 2025: GenAI Engineering One Line at a Time, ChatGPT's Sycophancy, and a paper on The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search.
You may also enjoy Xiaomi's MiMo-7B, Demystifying Verbatim Memorization in Large Language Models, a paper on Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI rolls back ChatGPT sycophancy, explains what went wrong
OpenAI has rolled back a recent update to its GPT-4o model after widespread reports that the system had become excessively flattering and overly agreeable.
Xiaomi releases new MiMo-7B models as DeepSeek upgrades its Prover math AI
Xiaomi released MiMo-7B, a new family of reasoning models that it claims can outperform OpenAI’s o1-mini at some tasks.
Anthropic suggests tweaks to proposed US AI chip export controls
Anthropic agrees with the U.S. government that implementing robust export controls on AI chips will help the U.S. compete in the AI race against China. But the company is suggesting a few tweaks to the proposed restrictions.
Meta announces standalone AI app for personalized assistance
Meta announced a new standalone Meta AI app that houses an AI assistant powered by the company’s Llama 4 model to provide a more personalized experience for users.
MLOps & LLMOps
Convergence 2025: GenAI Engineering One Line at at Time
AI leaders, engineers, and researchers from CrewAI, Meta, Apple, Uber, and more share real-world insights on building and productionizing GenAI systems, from LLM app architecture, to infrastructure, agent evaluation, RAG, and more in this live virtual conference May 13-14.
Build an automated generative AI solution evaluation pipeline with Amazon Nova
A blog post introducing an automated evaluation framework deployable on AWS.
Learning
We need to know if AI is more rational than humans, not smarter
A blog post that introduces a proof-of-concept benchmark for LLM rationality by adapting the ART-Y assessment.
Red Teaming is a Critical Thinking Exercise: Part 1
An article that presents AI red teaming not merely as technical vulnerability testing for LLMs, but as a critical thinking exercise originating from military and cybersecurity practices.
Demystifying Verbatim Memorization in Large Language Models
A blog post that shows how verbatim memorization is intertwined with a large language model’s general capabilities.
Libraries & Code
An open-source LLM evaluation tool used to debug, evaluate, monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
An open agentic framework that uses computers like a human.
Fully open, state-of-the-art Mixture of Expert model with 1.3 billion active and 6.9 billion total parameters.
Papers & Publications
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Abstract:
AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large.
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Abstract:
Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.