Deep Learning Weekly: Issue 386
IBM open sources new AI models for materials discovery, Unified Pure Vision Agents for Autonomous GUI Interaction, Momentum Approximation in Asynchronous Private Federated Learning, and much more!
This week in deep learning, we bring you IBM open sources new AI models for materials discovery, Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction and a paper on Momentum Approximation in Asynchronous Private Federated Learning.
You may also enjoy DeepSeek-V3 outperforms Llama and Qwen on launch, Inductive biases of neural network modularity in spatial navigation, a paper on Large Concept Models: Language Modeling in a Sentence Representation Space, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
IBM open sources new AI models for materials discovery
IBM open-sourced new AI models to accelerate materials discovery with applications in chip fabrication, clean energy, and consumer packaging.
A collection of AI predictions made in 2024 about advancements in AI capabilities, safety, and societal impact, with a focus on specific and testable predictions.
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch
Chinese AI startup DeepSeek, known for challenging leading AI vendors with its innovative open-source technologies, released a new ultra-large model: DeepSeek-V3.
ByteDance plans to spend $7B on cloud-based GPUs this year to fuel its AI ambitions
ByteDance reportedly has a plan to get around tough U.S. restrictions on the export of advanced computer chips to China.
AI-powered mineral exploration company KoBold Metals raises $527M
KoBold Metals, a California-based startup that specializes in using AI to discover new deposits of metals critical for batteries and renewable energy, has raised $527 million in equity funding.
MLOps & LLMOps
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
An article about AGUVIS, a unified pure vision-based framework for autonomous GUI agents.
Configuring Azure OpenAI with CrewAI: A Comprehensive Guide
A step-by-step guide to set up and configure Azure OpenAI within the CrewAI framework.
Learning
Inductive biases of neural network modularity in spatial navigation
A research blog post about how modular neural network architectures inspired by the human brain can improve learning and generalization in spatial navigation tasks.
Fine-tune classifier with ModernBERT in 2025
A blog post that demonstrates how to fine-tune ModernBERT, a new state-of-the-art encoder model, for classifying user prompts to implement an intelligent LLM router.
QwQ: Reflect Deeply on the Boundaries of the Unknown
A blog post about QwQ, a large language model from the Qwen Team that specializes in math and coding.
Maximum Likelihood Estimation and Loss Functions
A blog post about the connection between maximum likelihood estimation and loss functions in machine learning.
Superposition: What Makes it Difficult to Explain Neural Network
A blog post about superposition, a phenomenon in neural networks that makes model explainability challenging.
Libraries & Code
A barebones library for agents. Agents write python code to call tools and orchestrate other agents.
A high-performance RLHF framework built on Ray, DeepSpeed, and HF Transformers.
Papers & Publications
Momentum Approximation in Asynchronous Private Federated Learning
Abstract:
Asynchronous protocols have been shown to improve the scalability of federated learning (FL) with a massive number of clients. Meanwhile, momentum-based methods can achieve the best model quality in synchronous FL. However, naively applying momentum in asynchronous FL algorithms leads to slower convergence and degraded model performance. It is still unclear how to effectively combine these two techniques together to achieve a win-win. In this paper, we find that asynchrony introduces implicit bias to momentum updates. In order to address this problem, we propose momentum approximation that minimizes the bias by finding an optimal weighted average of all historical model updates. Momentum approximation is compatible with secure aggregation as well as differential privacy, and can be easily integrated in production FL systems with a minor communication and storage cost. We empirically demonstrate that on benchmark FL datasets, momentum approximation can achieve 1.15--4× speed up in convergence compared to existing asynchronous FL optimizers with momentum.
Large Concept Models: Language Modeling in a Sentence Representation Space
Abstract:
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities.
The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Abstract:
The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.