Deep Learning Weekly: Issue 390
OpenAI's Operator, How we evaluate AI models and LLMs for GitHub Copilot, a paper on AI Toolkit: Libraries and Essays for Exploring the Technology and Ethics of AI, and many more!
This week in deep learning, we bring you OpenAI's Operator, How we evaluate AI models and LLMs for GitHub Copilot and AI Toolkit: Libraries and Essays for Exploring the Technology and Ethics of AI.
You may also enjoy SmolVLM Grows Smaller, Which AI to Use Now: An Updated Opinionated Guide, a paper on Chain of Agents: Large Language Models Collaborating on Long-Context Tasks, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing Operator research preview | OpenAI
OpenAI released Operator, an agent that can go to the web to perform tasks for you. Using its own browser, it can look at a webpage and interact with it by typing, clicking, and scrolling.
Introducing Citations on the Anthropic API
Claude launched Citations, a new API feature that lets Claude ground its answers in source documents.
SmolVLM Grows Smaller – Introducing the 256M & 500M Models
The Hugging Face team announced two new additions to the SmolVLM family: SmolVLM-256M and SmolVLM-500M
Bluwhale raises $100M to build a dedicated 'intelligence layer' for decentralized AI agents
The Web3-native AI startup Bluwhale AI raised $100 million in funding for building an “intelligence layer” that spans all Layer-1 and Layer-2 blockchain networks.
MLOps & LLMOps
How we evaluate AI models and LLMs for GitHub Copilot
A blog post about the GitHub Copilot team’s experience evaluating AI models, with a focus on offline evaluations.
From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC
A blog post from AWS about using a router to handle heterogeneous data sources in RAG systems.
Learning
How to align open LLMs in 2025 with DPO & and synthetic data
A technical guide that focuses on aligning models using Direct Preference Optimization (DPO).
State of open video generation models in Diffusers
A Hugging Face article about the state of open video generation models and their optimizations in Diffusers.
Demo example - Scheming reasoning evaluations
An article from Apollo Research demonstrates how AI can engage in in-context scheming, where an AI, when given conflicting goals, may act against its developers' intentions.
Which AI to Use Now: An Updated Opinionated Guide
An opinionated guide about choosing a general-purpose AI, from Ethan Mollick.
Libraries & Code
A framework for comprehensive diagnosis and evaluation of agents using simulated, realistic synthetic interactions.
An advanced paper search agent powered by LLMs.
Papers & Publications
AI Toolkit: Libraries and Essays for Exploring the Technology and Ethics of AI
Abstract:
In this paper we describe the development and evaluation of AITK, the Artificial Intelligence Toolkit. This open-source project contains both Python libraries and computational essays (Jupyter notebooks) that together are designed to allow a diverse audience with little or no background in AI to interact with a variety of AI tools, exploring in more depth how they function, visualizing their outcomes, and gaining a better understanding of their ethical implications. These notebooks have been piloted at multiple institutions in a variety of humanities courses centered on the theme of responsible AI. In addition, we conducted usability testing of AITK. Our pilot studies and usability testing results indicate that AITK is easy to navigate and effective at helping users gain a better understanding of AI. Our goal, in this time of rapid innovations in AI, is for AITK to provide an accessible resource for faculty from any discipline looking to incorporate AI topics into their courses and for anyone eager to learn more about AI on their own.
Training Large Language Models to Reason in a Continuous Latent Space
Abstract:
Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks
Abstract:
Addressing the challenge of effectively processing long contexts has become a critical issue for Large Language Models (LLMs). Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. However, both strategies have drawbacks: input reduction has no guarantee of covering the part with needed information, while window extension struggles with focusing on the pertinent information for solving the task. To mitigate these limitations, we propose Chain-of-Agents (CoA), a novel framework that harnesses multi-agent collaboration through natural language to enable information aggregation and context reasoning across various LLMs over long-context tasks. CoA consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output. CoA processes the entire input by interleaving reading and reasoning, and it mitigates long context focus issues by assigning each agent a short context. We perform a comprehensive evaluation of CoA on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs.